introduction to spark on hadoop

Post on 06-Aug-2015

900 Views

Category:

Data & Analytics

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2014 MapR Technologies 1 © 2014 MapR Technologies

An Overview of Apache Spark

© 2014 MapR Technologies 2

Agenda

• MapReduce Refresher

• What is Spark?

• The Difference with Spark

• Examples and Resources

© 2014 MapR Technologies 3 © 2014 MapR Technologies

MapReduce Refresher

© 2014 MapR Technologies 4

MapReduce: A Programming Model

• MapReduce: Simplified Data Processing on Large Clusters (published 2004)

• Parallel and Distributed Algorithm:

• Data Locality

• Fault Tolerance

• Linear Scalability

© 2014 MapR Technologies 5

The Hadoop Strategy

http://developer.yahoo.com/hadoop/tutorial/module4.html

Distribute data (share nothing)

Distribute computation (parallelization without synchronization)

Tolerate failures (no single point of failure)

Node 1

Mapping process

Node 2

Mapping process

Node 3

Mapping process

Node 1

Reducing process

Node 2

Reducing process

Node 3

Reducing process

© 2014 MapR Technologies 6

Chunks are replicated across the cluster

Distribute Data: HDFS

User process

NameNode . . .

network

HDFS splits large data files

into chunks (64 MB)

metadata

access physical data access

Location metadata

DataNodes store & retrieve data

data

© 2014 MapR Technologies 7

Distribute Computation

MapReduce

Program

Data

Sources

Hadoop Cluster

Result

© 2014 MapR Technologies 8

MapReduce Basics

• Foundational model is based on a distributed file system

– Scalability and fault-tolerance

• Map

– Loading of the data and defining a set of keys

• Many use cases do not utilize a reduce task

• Reduce

– Collects the organized key-based data to process and output

• Performance can be tweaked based on known details of your

source files and cluster shape (size, total number)

© 2014 MapR Technologies 9

MapReduce Execution and Data Flow

Files loaded from HDFS stores

file file

Files loaded from HDFS stores

Node 1

InputFormat InputFormat

OutputFormat OutputFormat

Final (k, v) pairs Final (k, v) pairs

reduce reduce

(sort) (sort)

Input (k, v) pairs

map map map

RR RR RR

RecordReaders:

Split Split Split

Writeback to

Local HDFS

store

file

Writeback to

Local HDFS

store

file

Split Split Split

RR RR RR

RecordReaders:

Input (k, v) pairs

map map map

Node2

“Shuffle” process

Intermediate (k, v)

Pairs exchanged

By all nodes

Partitioner

Intermediate (k, v) pairs

Partitioner

Intermediate (k, v) pairs

© 2014 MapR Technologies 10

MapReduce Example: Word Count

Output

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

the, 1

time, 1

has, 1

come, 1

and, 1

and, 1

and, [1, 1, 1]

come, [1,1,1]

has, [1,1]

the, [1,1,1]

time, [1,1,1,1]

and, 12

come, 6

has, 8

the, 4

time, 14

Input Map Shuffle

and Sort Reduce Output

Reduce

© 2014 MapR Technologies 11

Tolerate Failures

Hadoop Cluster

Failures are expected & managed gracefully

DataNode fails -> name node will locate replica

MapReduce task fails -> job tracker will schedule another one

Data

© 2014 MapR Technologies 12

MapReduce Processing Model

• Define mappers

• Shuffling is automatic

• Define reducers

• For complex work, chain jobs together

– Use a higher level language or DSL that does this for you

© 2014 MapR Technologies 13

MapReduce Design Patterns

• Summarization – Inverted index, counting

• Filtering – Top ten, distinct

• Aggregation

• Data Organziation – partitioning

• Join – Join data sets

• Metapattern – Job chaining

© 2014 MapR Technologies 14

Inverted Index Example

come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . .

"The time has come," the

Walrus said

alice.txt

tis time to do it

macbeth.txt

time, alice.txt has, alice.txt come, alice.txt ..

tis, macbeth.txt time, macbeth.txt do, macbeth.txt …

© 2014 MapR Technologies 15

MapReduce Example:Inverted Index

• Input: (filename, text) records

• Output: list of files containing each word

• Map: foreach word in text.split(): output(word, filename)

• Combine: uniquify filenames for each word

• Reduce: def reduce(word, filenames): output(word, sort(filenames))

© 2014 MapR Technologies 18

MapReduce: The Good

• Built in fault tolerance

• Optimized IO path

• Scalable

• Developer focuses on Map/Reduce, not infrastructure

• simple? API

© 2014 MapR Technologies 19

MapReduce: The Bad

• Optimized for disk IO

– Doesn’t leverage memory well

– Iterative algorithms go through disk IO path again and again

• Primitive API

– simple abstraction

– Key/Value in/out

– basic things like join

• require extensive code

• Result often many files that need to be combined appropriately

© 2014 MapR Technologies 21

What is Hive?

• Data Warehouse on top of Hadoop

– Gives ability to query without programming

– Used for analytical querying of data

• SQL like execution for Hadoop

• SQL evaluates to MapReduce code

– Submits jobs to your cluster

© 2014 MapR Technologies 22

Using HBase as a MapReduce/Hive Source

EXAMPLE: Data Warehouse for Analytical Processing queries

Hive runs

MapReduce

application

Hive Select

Join HBase database

Files (HDFS/MapR-FS)

Query Result File

© 2014 MapR Technologies 23

Using HBase as a MapReduce or Hive Sink

EXAMPLE: bulk load data into a table

Files (HDFS/MapR-FS) HBase database Hive runs

MapReduce application

Hive Insert Select

© 2014 MapR Technologies 24

Using HBase as a Source & Sink

EXAMPLE: calculate and store summaries,

Pre-Computed, Materialized View

HBase database

Hive Select

Join

Hive runs

MapReduce

application

© 2014 MapR Technologies 25

Job

Tracker

Name

Node

HADOOP (MAP-REDUCE + HDFS)

Data Node +

Task Tracker

Hive Metastore

Driver

(compiler, Optimizer, Executor)

Command Line

Interface

Web Interface

JDBC

Thrift Server

ODBC

Metastore

Hive

The schema metadata is stored

in the Hive metastore

Hive Table definition HBase trades_tall Table

© 2014 MapR Technologies 26

Hive HBase

HBase Tables Hive

metastore

Points to Existing

Hive Managed

© 2014 MapR Technologies 27

Hive HBase – External Table

CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b")

TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall");

Points to

External

key string price bigint vol bigint key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

trades /usr/user1/trades_tall

Hive Table definition HBaseTable

© 2014 MapR Technologies 28

Hive HBase – Hive Query

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;

HBase Tables

Queries

Parser Planner Execution

© 2014 MapR Technologies 29

Hive HBase – External Table

key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

Selection

WHERE key like

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;

Projection

select price

Aggregation

Avg( price)

© 2014 MapR Technologies 30

Hive Query Plan

• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

predicate: (key like 'GOOG%') (type: boolean)

Select Operator

Group By Operator Reduce Operator Tree:

Group By Operator

Select Operator

File Output Operator

© 2014 MapR Technologies 31

Hive Query Plan – (2)

output

hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

col0

Trades

table

group

aggregations:

avg(price)

scan filter

Select key like 'GOOG%

Select price

Group by

map()

map()

map()

reduce() reduce()

© 2014 MapR Technologies 32

Hive Map Reduce

Region Region Region scan key, row

reduce()

shuffle

reduce()

reduce() Map() Map() Map()

Query Result File

HBase

Hive Select Join

Hive Query

result result result

© 2014 MapR Technologies 33

Some Hive Design Patterns

• Summarization

– Select min(delay), max(delay), count(*) from flights group by

carrier;

• Filtering

– SELECT * FROM trades WHERE key LIKE "GOOG%";

– SELECT price FROM trades DESC LIMIT 10 ;

• Join

SELECT tableA.field1, tableB.field2 FROM tableA

JOIN tableB

ON tableA.field1 = tableB.field2;

© 2014 MapR Technologies 34

What is a Directed Acylic Graph (DAG) ?

• Graph

– vertices (points) and edges (lines)

• Directed

– Only in a single direction

• Acyclic

– No looping

• This supports fault-tolerance

B A

B A

© 2014 MapR Technologies 35

Hive Query Plan Map Reduce Execution

FS1

AGG2

RS4

JOIN1

RS2

AGG1

RS1

t1

RS3

t1

Job 3

Job 2

FS1

AGG2

JOIN1

AGG1

RS1

t1

RS3 Job 1

Job 1

Optimize

© 2014 MapR Technologies 36

Slow !

Iteration: the bane of MapReduce

© 2014 MapR Technologies 37

Typical MapReduce Workflows

Input to

Job 1

SequenceFile

Last Job

Maps Reduces

SequenceFile

Job 1

Maps Reduces

SequenceFile

Job 2

Maps Reduces

Output from

Job 1

Output from

Job 2

Input to

last job

Output from

last job

HDFS

© 2014 MapR Technologies 38

Iterations

Step Step Step Step Step

In-memory Caching

• Data Partitions read from RAM instead of disk

© 2014 MapR Technologies 40

Lab – Query HBase airline data with Hive

Import mapping to Row Key and Columns:

Row-key

Carrier-

Flightnumber-

Date-

Origin- destination

delay info stats timing

Air

Craft delay

Arr delay

Carrier delay

cncl Cncl

code

tailnum distance elaptime arrtime Dep time

AA-1-2014-01-

01-JFK-LAX

13

0 N7704 2475 385.00 359 …

© 2014 MapR Technologies 41

Count number of cancellations by reason (code)

$ hive

hive> explain select count(*) as

cancellations, cnclcode from flighttable

where cncl=1 group by cnclcode order by

cancellations asc limit 100;

1 row

OK

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

Select Operator

Group By Operator

aggregations: count()

Reduce Output Operator

Reduce Operator Tree:

Group By Operator

aggregations: count(VALUE._col0)

Select Operator

File Output Operator

Stage: Stage-2

Map Reduce

Map Operator Tree:

TableScan

Reduce Output Operator

Reduce Operator Tree:

Extract

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

Limit

File Output Operator

Stage: Stage-0

Fetch Operator

limit: 100

© 2014 MapR Technologies 42

2 MapReduce jobs

$ hive

hive> select count(*) as cancellations, cnclcode from flighttable where

cncl=1 group by cnclcode order by cancellations asc limit 100;

1 row

Total jobs = 2

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 820 msec

OK

4598 C

7146 A

© 2014 MapR Technologies 43

Find the longest airline delays

$ hive

hive> select arrdelay,key from flighttable where arrdelay > 1000 order by

arrdelay desc limit 10;

1 row

MapReduce Jobs Launched:

Map: 1 Reduce: 1

OK

1530.0 AA-385-2014-01-18-BNA-DFW

1504.0 AA-1202-2014-01-15-ONT-DFW

1473.0 AA-1265-2014-01-05-CMH-LAX

1448.0 AA-1243-2014-01-21-IAD-DFW

1390.0 AA-1198-2014-01-11-PSP-DFW

1335.0 AA-1680-2014-01-21-SLC-DFW

1296.0 AA-1277-2014-01-21-BWI-DFW

1294.0 MQ-2894-2014-01-02-CVG-DFW

1201.0 MQ-3756-2014-01-01-CLT-MIA

1184.0 DL-2478-2014-01-10-BOS-ATL

© 2014 MapR Technologies 44 © 2014 MapR Technologies

Apache Spark

© 2014 MapR Technologies 45

Apache Spark

spark.apache.org

github.com/apache/spark

user@spark.apache.org

• Originally developed in 2009 in UC

Berkeley’s AMP Lab

• Fully open sourced in 2010 – now

a Top Level Project at the Apache

Software Foundation

© 2014 MapR Technologies 46

Spark: Fast Big Data

– Rich APIs in

Java, Scala,

Python

– Interactive shell

• Fast to Run

– General execution

graphs

– In-memory storage

2-5× less code

© 2014 MapR Technologies 47

The Spark Community

© 2014 MapR Technologies 48

Spark is the Most Active Open Source Project in Big Data

Giraph

Storm

Tez

0

20

40

60

80

100

120

140

Pro

ject

co

ntr

ibu

tors

in

past

year

© 2014 MapR Technologies 49

Spark SQL Spark Streaming

(Streaming)

MLlib

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform

© 2014 MapR Technologies 50

Spark Use Cases

• Iterative Algorithms on on large amounts of data

• Anomaly detection

• Classification

• Predictions

• Recommendations

© 2014 MapR Technologies 51

Why Iterative Algorithms

• Algorithms that need iterations

– Clustering (K-Means, Canopy, …)

– Gradient descent (e.g., Logistic Regression, Matrix Factorization)

– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,

reachability, centrality, )

– Alternating Least Squares ALS

– Graph communities / dense sub-components

– Inference (believe propagation) – …

51

© 2014 MapR Technologies 52

Example: Logistic Regression

• Goal: find best line separating two sets of points

target

random initial line

© 2014 MapR Technologies 53

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Iteration!

Logistic Regression

© 2014 MapR Technologies 54

Data Sources

• Local Files

– file:///opt/httpd/logs/access_log

• S3

• Hadoop Distributed Filesystem

– Regular files, sequence files, any other Hadoop InputFormat

• HBase

• other NoSQL data stores

© 2014 MapR Technologies 55 © 2014 MapR Technologies

How Spark Works

© 2014 MapR Technologies 56

Spark Programming Model

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

SparkContext

cluster

Worker Node

Task Task

Task Worker Node

© 2014 MapR Technologies 57

Resilient Distributed Datasets (RDD)

Spark revolves around RDDs

• Fault-tolerant

• read only collection of

elements

• operated on in parallel

• Cached in memory

• Or on disk

http://www.cs.berkeley.edu/~matei/papers/

2012/nsdi_spark.pdf

© 2014 MapR Technologies 58

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

© 2014 MapR Technologies 59

Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

© 2014 MapR Technologies 60

Working With RDDs

RDD RDD RDD RDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count()

74

linesWithSpark.first()

# Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

© 2014 MapR Technologies 61

MapR Tutorial: Getting Started with Spark on MapR Sandbox

• https://www.mapr.com/products/mapr-sandbox-

hadoop/tutorials/spark-tutorial

© 2014 MapR Technologies 62

Example Spark Word Count in Java

. . . the . . .

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

. . .

the, 20

JavaRDD<String> input = sc.textFile(inputFile);

// Split each line into words

JavaRDD<String> words = input.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String x) {

return Arrays.asList(x.split(" "));

}});

// Turn the words into (word, 1) pairs

JavaPairRDD<String, Integer> word1s== words.mapToPair(

new PairFunction<String, String, Integer>(){

public Tuple2<String, Integer> call(String x){

return new Tuple2(x, 1);

}});

// reduce add the pairs by key to produce counts

JavaPairRDD<String, Integer> counts =word1s.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer x, Integer y){

return x + y;

}});

. . . . . . . . .

© 2014 MapR Technologies 63

Example Spark Word Count in Scala

. . . the . . .

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

. . .

the, 20

// Load our input data.

val input = sc.textFile(inputFile)

// Split it up into words.

val words = input.flatMap(line => line.split(" "))

// Transform into pairs and count.

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

// Save the word count back out to a text file,

counts.saveAsTextFile(outputFile)

the, 20 time, 4 ….. and, 12

. . . . . . . . .

© 2014 MapR Technologies 64

Example Spark Word Count in Scala

64

HadoopRDD

textFile

// Load input data.

val input = sc.textFile(inputFile)

RDD

partitions

MapPartitionsRDD

© 2014 MapR Technologies 65

Example Spark Word Count in Scala

65

// Load our input data.

val input = sc.textFile(inputFile)

// Split it up into words.

val words = input.flatMap(line => line.split(" "))

HadoopRDD

textFile flatmap

MapPartitionsRDD

MapPartitionsRDD

© 2014 MapR Technologies 66

FlatMap

flatMap

line => line.split(" "))

1 to many mapping

Ships Ships

and

wax and

wax

JavaPairRDD<String> words

© 2014 MapR Technologies 67

Example Spark Word Count in Scala

67

textFile flatmap map

val input = sc.textFile(inputFile)

val words = input.flatMap(line => line.split(" "))

// Transform into pairs

val counts = words.map(word => (word, 1))

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

MapPartitionsRDD

© 2014 MapR Technologies 68

Map

map

word => (word, 1))

1 to 1 mapping

and and, 1

JavaPairRDD<String, Integer> word1s

© 2014 MapR Technologies 69

Example Spark Word Count in Scala

69

textFile flatmap map reduceByKey

val input = sc.textFile(inputFile)

val words = input.flatMap(line => line.split(" "))

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

© 2014 MapR Technologies 70

reduceByKey

reduceByKey

case (x, y) => x + y wax, 1

and, 1

and, 1

wax, 1

and, 2

JavaPairRDD<String, Integer> counts

© 2014 MapR Technologies 71

Example Spark Word Count in Scala

textFile flatmap map reduceByKey

val input = sc.textFile(inputFile)

val words = input.flatMap(line => line.split(" "))

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

val countArray = counts.collect()

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

MapPartitionsRDD

collect

ShuffledRDD

Array

© 2014 MapR Technologies 72 © 2014 MapR Technologies

Components Of Execution

© 2014 MapR Technologies 73

MapR Blog: Getting Started with the Spark Web UI

• https://www.mapr.com/blog/getting-started-spark-web-ui

© 2014 MapR Technologies 74

Spark RDD DAG -> Physical Execution plan

HadoopRDD

sc.textfile(…)

MapPartitionsRDD

flatmap

flatmap

reduceByKey

RDD Graph Physical Plan

collect

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

Stage 1

Stage 2

© 2014 MapR Technologies 75

Physical Plan

DAG

Stage 1

Stage 2

Task Task Task Task

Task Task Task

Stage 1

Stage 2

Split into Tasks

HFile

HDFS

Data Node

Worker Node

block

cache

partition

Executor

HFile

block

HFile

HFile

Task thread

Task

Set

Task

Scheduler Task

Physical Execution plan -> Stages and Tasks

© 2014 MapR Technologies 76

Summary of Components

• Task : unit of execution

• Stage: Group of Tasks

– Base on partitions of RDD

– Tasks run in parallel

• DAG : Logical Graph of RDD operations

• RDD : Parallel dataset with partitions

76

© 2014 MapR Technologies 77

How Spark Application runs on a Hadoop cluster

HFile

HDFS Data Node

Worker Node

block

cache

partition task task

Executor

HFile

block

HFile

HFile

SparkContext

zookeeper

YARN

Resource

Manager

HFile

HDFS Data Node

Worker Node

block

cache

partition task task

Executor

HFile

block

HFile

HFile

Client node

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

Yarn

Node

Manger

Yarn

Node

Manger

© 2014 MapR Technologies 78

Deploying Spark – Cluster Manager Types

• Standalone mode

• Mesos

• YARN

• EC2

• GCE

© 2014 MapR Technologies 79 © 2014 MapR Technologies

Example: Log Mining

© 2014 MapR Technologies 80

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Based on slides from Pat McDonough at

© 2014 MapR Technologies 81

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

© 2014 MapR Technologies 82

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

© 2014 MapR Technologies 83

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Base RDD

© 2014 MapR Technologies 84

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

© 2014 MapR Technologies 85

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Transformed RDD

© 2014 MapR Technologies 86

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

© 2014 MapR Technologies 87

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

© 2014 MapR Technologies 88

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

© 2014 MapR Technologies 89

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver tasks

tasks

tasks

© 2014 MapR Technologies 90

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read

HDFS

Block

Read

HDFS

Block

Read

HDFS

Block

© 2014 MapR Technologies 91

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process

& Cache

Data

Process

& Cache

Data

Process

& Cache

Data

© 2014 MapR Technologies 92

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

© 2014 MapR Technologies 93

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

© 2014 MapR Technologies 94

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

© 2014 MapR Technologies 95

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Process

from

Cache

Process

from

Cache

Process

from

Cache

© 2014 MapR Technologies 96

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

results

results

results

© 2014 MapR Technologies 97

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data Faster Results

Full-text search of Wikipedia

• 60GB on 20 EC2 machines

• 0.5 sec from cache vs. 20s for

on-disk

© 2014 MapR Technologies 98 © 2014 MapR Technologies

Transformations and Actions

© 2014 MapR Technologies 99

RDD Transformations and Actions

RDD RDD RDD RDD Transformations Action Value

Transformations (define a new RDD)

map filter sample union groupByKey reduceByKey join cache …

Actions (return a value)

reduce collect count save lookupKey …

© 2014 MapR Technologies 100

Basic Transformations

> nums = sc.parallelize([1, 2, 3])

# Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more others > nums.flatMap(lambda x: => range(x))

> # => {0, 0, 1, 0, 1, 2}

Range object (sequence of

numbers 0, 1, …, x-1)

© 2014 MapR Technologies 101

Basic Actions

> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3]

# Return first K elements > nums.take(2) # => [1, 2]

# Count number of elements > nums.count() # => 3

# Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)

© 2014 MapR Technologies 102

RDD Fault Recovery

• RDDs track lineage information

• can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter

(func = startsWith(…)) map

(func = split(...))

© 2014 MapR Technologies 103

Passing a function to Spark

• Spark is based on Anonymous function syntax

– (x: Int) => x *x

• Which is a shorthand for

new Function1[Int,Int] {

def apply(x: Int) = x * x

}

103

© 2014 MapR Technologies 104 © 2014 MapR Technologies

Dataframes

© 2014 MapR Technologies 105

DataFrame

Distributed collection of data organized into named columns // Create the DataFrame

val df = sqlContext.read.json("person.json")

// Print the schema in a tree format

df.printSchema()

// root

// |-- age: long (nullable = true)

// |-- name: string (nullable = true)

// |-- height: string (nullable = true)

// Select only the "name" column

df.select("name").show()

https://spark.apache.org/docs/latest/sql-programming-guide.html

© 2014 MapR Technologies 106

DataFrame RDD

• # data frame style

lineitems.groupby(‘customer’).agg(Map(

‘units’ > ‘avg’,

‘totalPrice’ > ‘std’

))

• # or SQL style

SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer

© 2014 MapR Technologies 107

Demo Interactive Shell

• Iterative Development

– Cache those RDDs

– Open the shell and ask questions

• We have all wished we could do this with MapReduce

– Compile / save your code for scheduled jobs later

• Scala – spark-shell

• Python – pyspark

© 2014 MapR Technologies 108

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data

• https://www.mapr.com/blog/using-apache-spark-dataframes-

processing-tabular-data

© 2014 MapR Technologies 109

The physical plan for DataFrames

© 2014 MapR Technologies 110

DataFrame Excecution plan

// Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

© 2014 MapR Technologies 111 © 2014 MapR Technologies

There’s a lot more !

© 2014 MapR Technologies 112

Spark SQL Spark Streaming

(Streaming)

MLlib

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform

© 2014 MapR Technologies 113

Soon to Come

• Spark On Demand Training – https://www.mapr.com/services/mapr-academy/

• Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering

– Spark Streaming

© 2014 MapR Technologies 114

Soon to Come

Blogs and Tutorials: – Re-write this mahout example with spark

© 2014 MapR Technologies 115 © 2014 MapR Technologies

Examples and Resources

© 2014 MapR Technologies 116

Spark on MapR

• Certified Spark Distribution

• Fully supported and packaged by MapR in partnership with

Databricks

– mapr-spark package with Spark, Shark, Spark Streaming today

– Spark-python, GraphX and MLLib soon

• YARN integration

– Spark can then allocate resources from cluster when needed

© 2014 MapR Technologies 118

Q & A

@mapr maprtech

kbotzum@mapr.com

Engage with us!

MapR

maprtech

mapr-technologies

top related