other map-reduce (ish) frameworks: spark william cohen 1

25
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

Upload: kristin-fox

Post on 06-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Recap: Issues with Hadoop Too much typing – programs are not concise Too low level – missing abstractions – hard to specify a workflow Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Workflow and memory-loading issues 3

TRANSCRIPT

Page 1: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

1

Other Map-Reduce (ish) Frameworks: Spark

William Cohen

Page 2: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

2

Recap: Last month• More concise languages for map-reduce pipelines• Abstractions built on top of map-reduce–General comments–Specific systems• Cascading, Pipes• PIG, Hive• Spark, Flink

Page 3: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

3

Recap: Issues with Hadoop• Too much typing– programs are not concise

• Too low level–missing abstractions– hard to specify a workflow

• Not well suited to iterative operations–E.g., E/M, k-means clustering, …–Workflow and memory-loading issues

Page 4: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

4

Spark• Too much typing– programs are not concise

• Too low level–missing abstractions– hard to specify a workflow

• Not well suited to iterative operations–E.g., E/M, k-means clustering, …–Workflow and memory-loading issues

Set of concise dataflow operations (“transformation”)

Dataflow operations are embedded in an API together with “actions”

Sharded files are replaced by “RDDs” – resiliant distributed datasets

RDDs can be cached in cluster memory and recreated to recover from error

Page 5: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

5

Spark examplesspark is a spark context object

Page 6: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

6

Spark exampleserrors is a

transformation, and thus a data

strucure that explains HOW to

do something

count() is an action: it will

actually execute the plan for

errors and return a value.

errors.filter() is a transformation

collect() is an action

everything is sharded, like inHadoop and GuineaPig

Page 7: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

7

Spark examples

# modify errors to be stored in cluster memory

subsequent actions will be much faster

everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)

You can also persist() an RDD on disk, whichis like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.

Page 8: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

8

Spark examples: wordcount

the actiontransformation on (key,value) pairs , which are special

Page 9: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

9

Spark examples: batch logistic regression

reduce is an action – it

produces a numby vector

p.x and w are vectors, from the numpy package

p.x and w are vectors, from the numpy package. Python overloads operations like *

and + for vectors.

Page 10: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

Spark examples: batch logistic regression

Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a

list of python floats.• numpy operations like dot, *, + are calls to

optimized C code• a little python logic around a lot of numpy calls is

pretty efficient

Page 11: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

11

Spark examples: batch logistic regression

w is defined outside the

lambda function, but used inside it

So: python builds a closure – code including the current

value of w – and Spark ships it off to each worker. So w is

copied, and must be read-only.

Page 12: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

12

Spark examples: batch logistic regression

dataset of points is cached in cluster

memory to reduce i/o

Page 13: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

13

Spark logistic regression example

Page 14: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

14

Spark

Page 15: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

15

Spark details: broadcast

So: python builds a closure – code including the current

value of w – and Spark ships it off to each worker. So w is

copied, and must be read-only.

Page 16: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

16

Spark details: broadcast

alternative: create a broadcast variable, e.g., • w_broad = spark.broadcast(w)which is accessed by the worker via • w_broad.value()

what’s sent is a small pointer to w (e.g., the name of a file containing a serialized version of w) and when value is called, some clever all-reduce like machinery is used to reduce network load.

little penalty for distributing something that’s not used by all workers

Page 17: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

17

Spark details: mapPartitions

Common issue:• map task requires loading in some small shared value• more generally, map task requires some sort of

initialization before processing a shard• GuineaPig: • special Augment … sideview … pattern for shared

values• can kludge up any initializer using Augment

• Raw Hadoop: mapper.configure() and mapper.close() methods

Page 18: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

18

Spark details: mapPartitions

Spark:• rdd.mapPartitions(f): will call

f(iteratorOverShard) once per shard, and return an iterator over the mapped values.

• f() can do any setup/close steps it needs

Also:• there are transformations to partition an RDD with a

user-selected function, like in Hadoop. Usually you partition and persist/cache.

Page 19: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

19

Spark – from logistic regression to matrix factorization

William Cohen

Page 20: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

20

Recovering latent factors in a matrix

m movies

n us

ers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

r

W

H

V

Recap

Page 21: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

21

Recap

Page 22: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

22

strata

Recap

Page 23: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

MF HW• How do you define the strata? first assign rows and columns to blocks then assign blocks to strata

11

11

1

22

22

2

1 2 3 4 55 1 2 3 44 5 1 2 33 4 5 1 22 3 4 5 1

colBlock = rowBlock

colBlock – rowBlock = 1 mod K

strata i defined bycolBlock – rowBlock = i mod K

Page 24: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

MF HWTheir algorithm:• for epoch t=1,….,T in sequence– for stratum s=1,…,K in sequence• for block b=1,… in stratum s in parallel– for triple (i,j,rating) in block b in sequence»do SGD step for (i,j,rating)

Page 25: Other Map-Reduce (ish) Frameworks: Spark William Cohen 1

MF HWOur algorithm:• cache the rating matrix into cluster memory• for epoch t=1,….,T in sequence

– for stratum s=1,…,K in sequence• distribute H, W to workers• for block b=1,… in stratum s, in parallel

– run SGD and collect the updates that are performed» i.e., the deltas (i,j,deltaH) or (i,j,deltaW)

– aggregate the deltas and apply the updates to H, W

like the logistic regression training data

like the outer loop for logistic regressionbroadcast numpy matrices

sort of like in logistic regression