introduction to spark on hadoop

116
© 2014 MapR Technologies 1 © 2014 MapR Technologies An Overview of Apache Spark

Upload: carol-mcdonald

Post on 06-Aug-2015

900 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Introduction to Spark on Hadoop

© 2014 MapR Technologies 1 © 2014 MapR Technologies

An Overview of Apache Spark

Page 2: Introduction to Spark on Hadoop

© 2014 MapR Technologies 2

Agenda

• MapReduce Refresher

• What is Spark?

• The Difference with Spark

• Examples and Resources

Page 3: Introduction to Spark on Hadoop

© 2014 MapR Technologies 3 © 2014 MapR Technologies

MapReduce Refresher

Page 4: Introduction to Spark on Hadoop

© 2014 MapR Technologies 4

MapReduce: A Programming Model

• MapReduce: Simplified Data Processing on Large Clusters (published 2004)

• Parallel and Distributed Algorithm:

• Data Locality

• Fault Tolerance

• Linear Scalability

Page 5: Introduction to Spark on Hadoop

© 2014 MapR Technologies 5

The Hadoop Strategy

http://developer.yahoo.com/hadoop/tutorial/module4.html

Distribute data (share nothing)

Distribute computation (parallelization without synchronization)

Tolerate failures (no single point of failure)

Node 1

Mapping process

Node 2

Mapping process

Node 3

Mapping process

Node 1

Reducing process

Node 2

Reducing process

Node 3

Reducing process

Page 6: Introduction to Spark on Hadoop

© 2014 MapR Technologies 6

Chunks are replicated across the cluster

Distribute Data: HDFS

User process

NameNode . . .

network

HDFS splits large data files

into chunks (64 MB)

metadata

access physical data access

Location metadata

DataNodes store & retrieve data

data

Page 7: Introduction to Spark on Hadoop

© 2014 MapR Technologies 7

Distribute Computation

MapReduce

Program

Data

Sources

Hadoop Cluster

Result

Page 8: Introduction to Spark on Hadoop

© 2014 MapR Technologies 8

MapReduce Basics

• Foundational model is based on a distributed file system

– Scalability and fault-tolerance

• Map

– Loading of the data and defining a set of keys

• Many use cases do not utilize a reduce task

• Reduce

– Collects the organized key-based data to process and output

• Performance can be tweaked based on known details of your

source files and cluster shape (size, total number)

Page 9: Introduction to Spark on Hadoop

© 2014 MapR Technologies 9

MapReduce Execution and Data Flow

Files loaded from HDFS stores

file file

Files loaded from HDFS stores

Node 1

InputFormat InputFormat

OutputFormat OutputFormat

Final (k, v) pairs Final (k, v) pairs

reduce reduce

(sort) (sort)

Input (k, v) pairs

map map map

RR RR RR

RecordReaders:

Split Split Split

Writeback to

Local HDFS

store

file

Writeback to

Local HDFS

store

file

Split Split Split

RR RR RR

RecordReaders:

Input (k, v) pairs

map map map

Node2

“Shuffle” process

Intermediate (k, v)

Pairs exchanged

By all nodes

Partitioner

Intermediate (k, v) pairs

Partitioner

Intermediate (k, v) pairs

Page 10: Introduction to Spark on Hadoop

© 2014 MapR Technologies 10

MapReduce Example: Word Count

Output

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

the, 1

time, 1

has, 1

come, 1

and, 1

and, 1

and, [1, 1, 1]

come, [1,1,1]

has, [1,1]

the, [1,1,1]

time, [1,1,1,1]

and, 12

come, 6

has, 8

the, 4

time, 14

Input Map Shuffle

and Sort Reduce Output

Reduce

Page 11: Introduction to Spark on Hadoop

© 2014 MapR Technologies 11

Tolerate Failures

Hadoop Cluster

Failures are expected & managed gracefully

DataNode fails -> name node will locate replica

MapReduce task fails -> job tracker will schedule another one

Data

Page 12: Introduction to Spark on Hadoop

© 2014 MapR Technologies 12

MapReduce Processing Model

• Define mappers

• Shuffling is automatic

• Define reducers

• For complex work, chain jobs together

– Use a higher level language or DSL that does this for you

Page 13: Introduction to Spark on Hadoop

© 2014 MapR Technologies 13

MapReduce Design Patterns

• Summarization – Inverted index, counting

• Filtering – Top ten, distinct

• Aggregation

• Data Organziation – partitioning

• Join – Join data sets

• Metapattern – Job chaining

Page 14: Introduction to Spark on Hadoop

© 2014 MapR Technologies 14

Inverted Index Example

come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . .

"The time has come," the

Walrus said

alice.txt

tis time to do it

macbeth.txt

time, alice.txt has, alice.txt come, alice.txt ..

tis, macbeth.txt time, macbeth.txt do, macbeth.txt …

Page 15: Introduction to Spark on Hadoop

© 2014 MapR Technologies 15

MapReduce Example:Inverted Index

• Input: (filename, text) records

• Output: list of files containing each word

• Map: foreach word in text.split(): output(word, filename)

• Combine: uniquify filenames for each word

• Reduce: def reduce(word, filenames): output(word, sort(filenames))

Page 16: Introduction to Spark on Hadoop

© 2014 MapR Technologies 18

MapReduce: The Good

• Built in fault tolerance

• Optimized IO path

• Scalable

• Developer focuses on Map/Reduce, not infrastructure

• simple? API

Page 17: Introduction to Spark on Hadoop

© 2014 MapR Technologies 19

MapReduce: The Bad

• Optimized for disk IO

– Doesn’t leverage memory well

– Iterative algorithms go through disk IO path again and again

• Primitive API

– simple abstraction

– Key/Value in/out

– basic things like join

• require extensive code

• Result often many files that need to be combined appropriately

Page 19: Introduction to Spark on Hadoop

© 2014 MapR Technologies 21

What is Hive?

• Data Warehouse on top of Hadoop

– Gives ability to query without programming

– Used for analytical querying of data

• SQL like execution for Hadoop

• SQL evaluates to MapReduce code

– Submits jobs to your cluster

Page 20: Introduction to Spark on Hadoop

© 2014 MapR Technologies 22

Using HBase as a MapReduce/Hive Source

EXAMPLE: Data Warehouse for Analytical Processing queries

Hive runs

MapReduce

application

Hive Select

Join HBase database

Files (HDFS/MapR-FS)

Query Result File

Page 21: Introduction to Spark on Hadoop

© 2014 MapR Technologies 23

Using HBase as a MapReduce or Hive Sink

EXAMPLE: bulk load data into a table

Files (HDFS/MapR-FS) HBase database Hive runs

MapReduce application

Hive Insert Select

Page 22: Introduction to Spark on Hadoop

© 2014 MapR Technologies 24

Using HBase as a Source & Sink

EXAMPLE: calculate and store summaries,

Pre-Computed, Materialized View

HBase database

Hive Select

Join

Hive runs

MapReduce

application

Page 23: Introduction to Spark on Hadoop

© 2014 MapR Technologies 25

Job

Tracker

Name

Node

HADOOP (MAP-REDUCE + HDFS)

Data Node +

Task Tracker

Hive Metastore

Driver

(compiler, Optimizer, Executor)

Command Line

Interface

Web Interface

JDBC

Thrift Server

ODBC

Metastore

Hive

The schema metadata is stored

in the Hive metastore

Hive Table definition HBase trades_tall Table

Page 24: Introduction to Spark on Hadoop

© 2014 MapR Technologies 26

Hive HBase

HBase Tables Hive

metastore

Points to Existing

Hive Managed

Page 25: Introduction to Spark on Hadoop

© 2014 MapR Technologies 27

Hive HBase – External Table

CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b")

TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall");

Points to

External

key string price bigint vol bigint key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

trades /usr/user1/trades_tall

Hive Table definition HBaseTable

Page 26: Introduction to Spark on Hadoop

© 2014 MapR Technologies 28

Hive HBase – Hive Query

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;

HBase Tables

Queries

Parser Planner Execution

Page 27: Introduction to Spark on Hadoop

© 2014 MapR Technologies 29

Hive HBase – External Table

key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

Selection

WHERE key like

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;

Projection

select price

Aggregation

Avg( price)

Page 28: Introduction to Spark on Hadoop

© 2014 MapR Technologies 30

Hive Query Plan

• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

predicate: (key like 'GOOG%') (type: boolean)

Select Operator

Group By Operator Reduce Operator Tree:

Group By Operator

Select Operator

File Output Operator

Page 29: Introduction to Spark on Hadoop

© 2014 MapR Technologies 31

Hive Query Plan – (2)

output

hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

col0

Trades

table

group

aggregations:

avg(price)

scan filter

Select key like 'GOOG%

Select price

Group by

map()

map()

map()

reduce() reduce()

Page 30: Introduction to Spark on Hadoop

© 2014 MapR Technologies 32

Hive Map Reduce

Region Region Region scan key, row

reduce()

shuffle

reduce()

reduce() Map() Map() Map()

Query Result File

HBase

Hive Select Join

Hive Query

result result result

Page 31: Introduction to Spark on Hadoop

© 2014 MapR Technologies 33

Some Hive Design Patterns

• Summarization

– Select min(delay), max(delay), count(*) from flights group by

carrier;

• Filtering

– SELECT * FROM trades WHERE key LIKE "GOOG%";

– SELECT price FROM trades DESC LIMIT 10 ;

• Join

SELECT tableA.field1, tableB.field2 FROM tableA

JOIN tableB

ON tableA.field1 = tableB.field2;

Page 32: Introduction to Spark on Hadoop

© 2014 MapR Technologies 34

What is a Directed Acylic Graph (DAG) ?

• Graph

– vertices (points) and edges (lines)

• Directed

– Only in a single direction

• Acyclic

– No looping

• This supports fault-tolerance

B A

B A

Page 33: Introduction to Spark on Hadoop

© 2014 MapR Technologies 35

Hive Query Plan Map Reduce Execution

FS1

AGG2

RS4

JOIN1

RS2

AGG1

RS1

t1

RS3

t1

Job 3

Job 2

FS1

AGG2

JOIN1

AGG1

RS1

t1

RS3 Job 1

Job 1

Optimize

Page 34: Introduction to Spark on Hadoop

© 2014 MapR Technologies 36

Slow !

Iteration: the bane of MapReduce

Page 35: Introduction to Spark on Hadoop

© 2014 MapR Technologies 37

Typical MapReduce Workflows

Input to

Job 1

SequenceFile

Last Job

Maps Reduces

SequenceFile

Job 1

Maps Reduces

SequenceFile

Job 2

Maps Reduces

Output from

Job 1

Output from

Job 2

Input to

last job

Output from

last job

HDFS

Page 36: Introduction to Spark on Hadoop

© 2014 MapR Technologies 38

Iterations

Step Step Step Step Step

In-memory Caching

• Data Partitions read from RAM instead of disk

Page 38: Introduction to Spark on Hadoop

© 2014 MapR Technologies 40

Lab – Query HBase airline data with Hive

Import mapping to Row Key and Columns:

Row-key

Carrier-

Flightnumber-

Date-

Origin- destination

delay info stats timing

Air

Craft delay

Arr delay

Carrier delay

cncl Cncl

code

tailnum distance elaptime arrtime Dep time

AA-1-2014-01-

01-JFK-LAX

13

0 N7704 2475 385.00 359 …

Page 39: Introduction to Spark on Hadoop

© 2014 MapR Technologies 41

Count number of cancellations by reason (code)

$ hive

hive> explain select count(*) as

cancellations, cnclcode from flighttable

where cncl=1 group by cnclcode order by

cancellations asc limit 100;

1 row

OK

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

Select Operator

Group By Operator

aggregations: count()

Reduce Output Operator

Reduce Operator Tree:

Group By Operator

aggregations: count(VALUE._col0)

Select Operator

File Output Operator

Stage: Stage-2

Map Reduce

Map Operator Tree:

TableScan

Reduce Output Operator

Reduce Operator Tree:

Extract

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

Limit

File Output Operator

Stage: Stage-0

Fetch Operator

limit: 100

Page 40: Introduction to Spark on Hadoop

© 2014 MapR Technologies 42

2 MapReduce jobs

$ hive

hive> select count(*) as cancellations, cnclcode from flighttable where

cncl=1 group by cnclcode order by cancellations asc limit 100;

1 row

Total jobs = 2

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 820 msec

OK

4598 C

7146 A

Page 41: Introduction to Spark on Hadoop

© 2014 MapR Technologies 43

Find the longest airline delays

$ hive

hive> select arrdelay,key from flighttable where arrdelay > 1000 order by

arrdelay desc limit 10;

1 row

MapReduce Jobs Launched:

Map: 1 Reduce: 1

OK

1530.0 AA-385-2014-01-18-BNA-DFW

1504.0 AA-1202-2014-01-15-ONT-DFW

1473.0 AA-1265-2014-01-05-CMH-LAX

1448.0 AA-1243-2014-01-21-IAD-DFW

1390.0 AA-1198-2014-01-11-PSP-DFW

1335.0 AA-1680-2014-01-21-SLC-DFW

1296.0 AA-1277-2014-01-21-BWI-DFW

1294.0 MQ-2894-2014-01-02-CVG-DFW

1201.0 MQ-3756-2014-01-01-CLT-MIA

1184.0 DL-2478-2014-01-10-BOS-ATL

Page 42: Introduction to Spark on Hadoop

© 2014 MapR Technologies 44 © 2014 MapR Technologies

Apache Spark

Page 43: Introduction to Spark on Hadoop

© 2014 MapR Technologies 45

Apache Spark

spark.apache.org

github.com/apache/spark

[email protected]

• Originally developed in 2009 in UC

Berkeley’s AMP Lab

• Fully open sourced in 2010 – now

a Top Level Project at the Apache

Software Foundation

Page 44: Introduction to Spark on Hadoop

© 2014 MapR Technologies 46

Spark: Fast Big Data

– Rich APIs in

Java, Scala,

Python

– Interactive shell

• Fast to Run

– General execution

graphs

– In-memory storage

2-5× less code

Page 45: Introduction to Spark on Hadoop

© 2014 MapR Technologies 47

The Spark Community

Page 46: Introduction to Spark on Hadoop

© 2014 MapR Technologies 48

Spark is the Most Active Open Source Project in Big Data

Giraph

Storm

Tez

0

20

40

60

80

100

120

140

Pro

ject

co

ntr

ibu

tors

in

past

year

Page 47: Introduction to Spark on Hadoop

© 2014 MapR Technologies 49

Spark SQL Spark Streaming

(Streaming)

MLlib

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform

Page 48: Introduction to Spark on Hadoop

© 2014 MapR Technologies 50

Spark Use Cases

• Iterative Algorithms on on large amounts of data

• Anomaly detection

• Classification

• Predictions

• Recommendations

Page 49: Introduction to Spark on Hadoop

© 2014 MapR Technologies 51

Why Iterative Algorithms

• Algorithms that need iterations

– Clustering (K-Means, Canopy, …)

– Gradient descent (e.g., Logistic Regression, Matrix Factorization)

– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,

reachability, centrality, )

– Alternating Least Squares ALS

– Graph communities / dense sub-components

– Inference (believe propagation) – …

51

Page 50: Introduction to Spark on Hadoop

© 2014 MapR Technologies 52

Example: Logistic Regression

• Goal: find best line separating two sets of points

target

random initial line

Page 51: Introduction to Spark on Hadoop

© 2014 MapR Technologies 53

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Iteration!

Logistic Regression

Page 52: Introduction to Spark on Hadoop

© 2014 MapR Technologies 54

Data Sources

• Local Files

– file:///opt/httpd/logs/access_log

• S3

• Hadoop Distributed Filesystem

– Regular files, sequence files, any other Hadoop InputFormat

• HBase

• other NoSQL data stores

Page 53: Introduction to Spark on Hadoop

© 2014 MapR Technologies 55 © 2014 MapR Technologies

How Spark Works

Page 54: Introduction to Spark on Hadoop

© 2014 MapR Technologies 56

Spark Programming Model

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

SparkContext

cluster

Worker Node

Task Task

Task Worker Node

Page 55: Introduction to Spark on Hadoop

© 2014 MapR Technologies 57

Resilient Distributed Datasets (RDD)

Spark revolves around RDDs

• Fault-tolerant

• read only collection of

elements

• operated on in parallel

• Cached in memory

• Or on disk

http://www.cs.berkeley.edu/~matei/papers/

2012/nsdi_spark.pdf

Page 56: Introduction to Spark on Hadoop

© 2014 MapR Technologies 58

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Page 57: Introduction to Spark on Hadoop

© 2014 MapR Technologies 59

Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

Page 58: Introduction to Spark on Hadoop

© 2014 MapR Technologies 60

Working With RDDs

RDD RDD RDD RDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count()

74

linesWithSpark.first()

# Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

Page 59: Introduction to Spark on Hadoop

© 2014 MapR Technologies 61

MapR Tutorial: Getting Started with Spark on MapR Sandbox

• https://www.mapr.com/products/mapr-sandbox-

hadoop/tutorials/spark-tutorial

Page 60: Introduction to Spark on Hadoop

© 2014 MapR Technologies 62

Example Spark Word Count in Java

. . . the . . .

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

. . .

the, 20

JavaRDD<String> input = sc.textFile(inputFile);

// Split each line into words

JavaRDD<String> words = input.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String x) {

return Arrays.asList(x.split(" "));

}});

// Turn the words into (word, 1) pairs

JavaPairRDD<String, Integer> word1s== words.mapToPair(

new PairFunction<String, String, Integer>(){

public Tuple2<String, Integer> call(String x){

return new Tuple2(x, 1);

}});

// reduce add the pairs by key to produce counts

JavaPairRDD<String, Integer> counts =word1s.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer x, Integer y){

return x + y;

}});

. . . . . . . . .

Page 61: Introduction to Spark on Hadoop

© 2014 MapR Technologies 63

Example Spark Word Count in Scala

. . . the . . .

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

. . .

the, 20

// Load our input data.

val input = sc.textFile(inputFile)

// Split it up into words.

val words = input.flatMap(line => line.split(" "))

// Transform into pairs and count.

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

// Save the word count back out to a text file,

counts.saveAsTextFile(outputFile)

the, 20 time, 4 ….. and, 12

. . . . . . . . .

Page 62: Introduction to Spark on Hadoop

© 2014 MapR Technologies 64

Example Spark Word Count in Scala

64

HadoopRDD

textFile

// Load input data.

val input = sc.textFile(inputFile)

RDD

partitions

MapPartitionsRDD

Page 63: Introduction to Spark on Hadoop

© 2014 MapR Technologies 65

Example Spark Word Count in Scala

65

// Load our input data.

val input = sc.textFile(inputFile)

// Split it up into words.

val words = input.flatMap(line => line.split(" "))

HadoopRDD

textFile flatmap

MapPartitionsRDD

MapPartitionsRDD

Page 64: Introduction to Spark on Hadoop

© 2014 MapR Technologies 66

FlatMap

flatMap

line => line.split(" "))

1 to many mapping

Ships Ships

and

wax and

wax

JavaPairRDD<String> words

Page 65: Introduction to Spark on Hadoop

© 2014 MapR Technologies 67

Example Spark Word Count in Scala

67

textFile flatmap map

val input = sc.textFile(inputFile)

val words = input.flatMap(line => line.split(" "))

// Transform into pairs

val counts = words.map(word => (word, 1))

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

MapPartitionsRDD

Page 66: Introduction to Spark on Hadoop

© 2014 MapR Technologies 68

Map

map

word => (word, 1))

1 to 1 mapping

and and, 1

JavaPairRDD<String, Integer> word1s

Page 67: Introduction to Spark on Hadoop

© 2014 MapR Technologies 69

Example Spark Word Count in Scala

69

textFile flatmap map reduceByKey

val input = sc.textFile(inputFile)

val words = input.flatMap(line => line.split(" "))

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

Page 68: Introduction to Spark on Hadoop

© 2014 MapR Technologies 70

reduceByKey

reduceByKey

case (x, y) => x + y wax, 1

and, 1

and, 1

wax, 1

and, 2

JavaPairRDD<String, Integer> counts

Page 69: Introduction to Spark on Hadoop

© 2014 MapR Technologies 71

Example Spark Word Count in Scala

textFile flatmap map reduceByKey

val input = sc.textFile(inputFile)

val words = input.flatMap(line => line.split(" "))

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

val countArray = counts.collect()

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

MapPartitionsRDD

collect

ShuffledRDD

Array

Page 70: Introduction to Spark on Hadoop

© 2014 MapR Technologies 72 © 2014 MapR Technologies

Components Of Execution

Page 71: Introduction to Spark on Hadoop

© 2014 MapR Technologies 73

MapR Blog: Getting Started with the Spark Web UI

• https://www.mapr.com/blog/getting-started-spark-web-ui

Page 72: Introduction to Spark on Hadoop

© 2014 MapR Technologies 74

Spark RDD DAG -> Physical Execution plan

HadoopRDD

sc.textfile(…)

MapPartitionsRDD

flatmap

flatmap

reduceByKey

RDD Graph Physical Plan

collect

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

Stage 1

Stage 2

Page 73: Introduction to Spark on Hadoop

© 2014 MapR Technologies 75

Physical Plan

DAG

Stage 1

Stage 2

Task Task Task Task

Task Task Task

Stage 1

Stage 2

Split into Tasks

HFile

HDFS

Data Node

Worker Node

block

cache

partition

Executor

HFile

block

HFile

HFile

Task thread

Task

Set

Task

Scheduler Task

Physical Execution plan -> Stages and Tasks

Page 74: Introduction to Spark on Hadoop

© 2014 MapR Technologies 76

Summary of Components

• Task : unit of execution

• Stage: Group of Tasks

– Base on partitions of RDD

– Tasks run in parallel

• DAG : Logical Graph of RDD operations

• RDD : Parallel dataset with partitions

76

Page 75: Introduction to Spark on Hadoop

© 2014 MapR Technologies 77

How Spark Application runs on a Hadoop cluster

HFile

HDFS Data Node

Worker Node

block

cache

partition task task

Executor

HFile

block

HFile

HFile

SparkContext

zookeeper

YARN

Resource

Manager

HFile

HDFS Data Node

Worker Node

block

cache

partition task task

Executor

HFile

block

HFile

HFile

Client node

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

Yarn

Node

Manger

Yarn

Node

Manger

Page 76: Introduction to Spark on Hadoop

© 2014 MapR Technologies 78

Deploying Spark – Cluster Manager Types

• Standalone mode

• Mesos

• YARN

• EC2

• GCE

Page 77: Introduction to Spark on Hadoop

© 2014 MapR Technologies 79 © 2014 MapR Technologies

Example: Log Mining

Page 78: Introduction to Spark on Hadoop

© 2014 MapR Technologies 80

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Based on slides from Pat McDonough at

Page 79: Introduction to Spark on Hadoop

© 2014 MapR Technologies 81

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

Page 80: Introduction to Spark on Hadoop

© 2014 MapR Technologies 82

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Page 81: Introduction to Spark on Hadoop

© 2014 MapR Technologies 83

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Base RDD

Page 82: Introduction to Spark on Hadoop

© 2014 MapR Technologies 84

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Page 83: Introduction to Spark on Hadoop

© 2014 MapR Technologies 85

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Transformed RDD

Page 84: Introduction to Spark on Hadoop

© 2014 MapR Technologies 86

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Page 85: Introduction to Spark on Hadoop

© 2014 MapR Technologies 87

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Page 86: Introduction to Spark on Hadoop

© 2014 MapR Technologies 88

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Page 87: Introduction to Spark on Hadoop

© 2014 MapR Technologies 89

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver tasks

tasks

tasks

Page 88: Introduction to Spark on Hadoop

© 2014 MapR Technologies 90

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read

HDFS

Block

Read

HDFS

Block

Read

HDFS

Block

Page 89: Introduction to Spark on Hadoop

© 2014 MapR Technologies 91

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process

& Cache

Data

Process

& Cache

Data

Process

& Cache

Data

Page 90: Introduction to Spark on Hadoop

© 2014 MapR Technologies 92

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Page 91: Introduction to Spark on Hadoop

© 2014 MapR Technologies 93

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Page 92: Introduction to Spark on Hadoop

© 2014 MapR Technologies 94

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Page 93: Introduction to Spark on Hadoop

© 2014 MapR Technologies 95

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Process

from

Cache

Process

from

Cache

Process

from

Cache

Page 94: Introduction to Spark on Hadoop

© 2014 MapR Technologies 96

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

results

results

results

Page 95: Introduction to Spark on Hadoop

© 2014 MapR Technologies 97

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data Faster Results

Full-text search of Wikipedia

• 60GB on 20 EC2 machines

• 0.5 sec from cache vs. 20s for

on-disk

Page 96: Introduction to Spark on Hadoop

© 2014 MapR Technologies 98 © 2014 MapR Technologies

Transformations and Actions

Page 97: Introduction to Spark on Hadoop

© 2014 MapR Technologies 99

RDD Transformations and Actions

RDD RDD RDD RDD Transformations Action Value

Transformations (define a new RDD)

map filter sample union groupByKey reduceByKey join cache …

Actions (return a value)

reduce collect count save lookupKey …

Page 98: Introduction to Spark on Hadoop

© 2014 MapR Technologies 100

Basic Transformations

> nums = sc.parallelize([1, 2, 3])

# Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more others > nums.flatMap(lambda x: => range(x))

> # => {0, 0, 1, 0, 1, 2}

Range object (sequence of

numbers 0, 1, …, x-1)

Page 99: Introduction to Spark on Hadoop

© 2014 MapR Technologies 101

Basic Actions

> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3]

# Return first K elements > nums.take(2) # => [1, 2]

# Count number of elements > nums.count() # => 3

# Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)

Page 100: Introduction to Spark on Hadoop

© 2014 MapR Technologies 102

RDD Fault Recovery

• RDDs track lineage information

• can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter

(func = startsWith(…)) map

(func = split(...))

Page 101: Introduction to Spark on Hadoop

© 2014 MapR Technologies 103

Passing a function to Spark

• Spark is based on Anonymous function syntax

– (x: Int) => x *x

• Which is a shorthand for

new Function1[Int,Int] {

def apply(x: Int) = x * x

}

103

Page 102: Introduction to Spark on Hadoop

© 2014 MapR Technologies 104 © 2014 MapR Technologies

Dataframes

Page 103: Introduction to Spark on Hadoop

© 2014 MapR Technologies 105

DataFrame

Distributed collection of data organized into named columns // Create the DataFrame

val df = sqlContext.read.json("person.json")

// Print the schema in a tree format

df.printSchema()

// root

// |-- age: long (nullable = true)

// |-- name: string (nullable = true)

// |-- height: string (nullable = true)

// Select only the "name" column

df.select("name").show()

https://spark.apache.org/docs/latest/sql-programming-guide.html

Page 104: Introduction to Spark on Hadoop

© 2014 MapR Technologies 106

DataFrame RDD

• # data frame style

lineitems.groupby(‘customer’).agg(Map(

‘units’ > ‘avg’,

‘totalPrice’ > ‘std’

))

• # or SQL style

SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer

Page 105: Introduction to Spark on Hadoop

© 2014 MapR Technologies 107

Demo Interactive Shell

• Iterative Development

– Cache those RDDs

– Open the shell and ask questions

• We have all wished we could do this with MapReduce

– Compile / save your code for scheduled jobs later

• Scala – spark-shell

• Python – pyspark

Page 106: Introduction to Spark on Hadoop

© 2014 MapR Technologies 108

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data

• https://www.mapr.com/blog/using-apache-spark-dataframes-

processing-tabular-data

Page 107: Introduction to Spark on Hadoop

© 2014 MapR Technologies 109

The physical plan for DataFrames

Page 108: Introduction to Spark on Hadoop

© 2014 MapR Technologies 110

DataFrame Excecution plan

// Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

Page 109: Introduction to Spark on Hadoop

© 2014 MapR Technologies 111 © 2014 MapR Technologies

There’s a lot more !

Page 110: Introduction to Spark on Hadoop

© 2014 MapR Technologies 112

Spark SQL Spark Streaming

(Streaming)

MLlib

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform

Page 111: Introduction to Spark on Hadoop

© 2014 MapR Technologies 113

Soon to Come

• Spark On Demand Training – https://www.mapr.com/services/mapr-academy/

• Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering

– Spark Streaming

Page 112: Introduction to Spark on Hadoop

© 2014 MapR Technologies 114

Soon to Come

Blogs and Tutorials: – Re-write this mahout example with spark

Page 113: Introduction to Spark on Hadoop

© 2014 MapR Technologies 115 © 2014 MapR Technologies

Examples and Resources

Page 114: Introduction to Spark on Hadoop

© 2014 MapR Technologies 116

Spark on MapR

• Certified Spark Distribution

• Fully supported and packaged by MapR in partnership with

Databricks

– mapr-spark package with Spark, Shark, Spark Streaming today

– Spark-python, GraphX and MLLib soon

• YARN integration

– Spark can then allocate resources from cluster when needed

Page 116: Introduction to Spark on Hadoop

© 2014 MapR Technologies 118

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies