1 cluster computing and datalog recursion via map-reduce seminaïve evaluation re-engineering...

26
1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

Upload: emery-emory-henderson

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

1

Cluster Computing and Datalog

Recursion Via Map-ReduceSeminaïve Evaluation

Re-engineering Map-Reduce for Recursion

Page 2: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

2

Acknowledgements

Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar

contributed to the architecture discussions.

Page 3: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

3

Implementing Datalog via Map-Reduce

Joins are straightforward to implement as a round of map-reduce.

Likewise, union/duplicate-elimination is a round of map-reduce.

But implementation of a recursion can thus take many rounds of map-reduce.

Page 4: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

4

Seminaïve Evaluation

Specific combination of joins and unions.

Example: chain ruleq(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

Let r, s, t = “old” relations; r’, s’, t’ = incremental relations.

Simplification: assume |r’| = a|r|, etc.

Page 5: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

5

A 3-Way Join Using Map-Reduce

q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Use k compute nodes. Give X and Y shares to determine

the reduce-task that gets each tuple.

Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|.

Page 6: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

6

Seminaïve Evaluation – (2)

Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’

Obvious method for computing a round of seminaïve evaluation: Replicate r and r’; replicate t and t’; do

not replicate s or s’. Communication = (1+a)(|s| + 2k|r||t|)

Page 7: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

7

Seminaïve Evaluation – (3)

There are many other ways we might use k nodes to do the same task.

Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’.

Theorem: no grouping does better than the obvious method for this example.

Page 8: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

8

Networks of Processes for Recursions

Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost?

Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors.

Page 9: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

9

Example: Very Simple Recursion

p(X,Y) :- e(X,Z) & p(Z,Y);p(X,Y) :- p0(X,Y);

Use k compute nodes. Hash Y-values to one of k buckets h(Y). Each node gets a complete copy of e. p0 is distributed among the k nodes,

with p0(x,y) going to node h(y).

Page 10: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

10

Example – Continued

p(X,Y) :- e(X,Z) & p(Z,Y) Each node applies the recursive rule

and generates new tuples p(x,y). Key point: since new tuples have a Y-

value that hashes to the same node, no communication is necessary.

Duplicates are eliminated locally.

Page 11: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

11

Harder Case of Recursion

Consider a recursive rulep(X,Y) :- p(X,Z) & p(Z,Y)

Responsibility divided among compute nodes by hashing Z-values.

Node n gets tuple p(a,b) if either h(a) = n or h(b) = n.

Page 12: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

12

Compute Node for h(Z) = n

Node forh(Z) = n

Remember allReceived tuples

(eliminateduplicates)

p(a,b) ifh(a) = nor h(b) = n

p(c,d)produced

To nodesfor h(c) and h(d)

Search formatches

Page 13: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

13

Comparison with Iteration

Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds.

Disadvantage: Tasks run longer, more likely to fail.

Page 14: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

14

Node Failures

To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes.

But recursions can’t work that way. What happens if a node fails after

some of its output has been consumed?

Page 15: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

15

Node Failures – (2)

Actually, there is no problem! We restart the tasks of the failed

node at another node. The replacement task will send

some data that the failed task also sent.

But each node remembers tuples to eliminate duplicates anyway.

Page 16: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

16

Node Failures – (3)

But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets.

Argument would fail if we were computing bags or aggregations of the tuples produced.

Similar problems for other recursions, e.g., PDE’s.

Page 17: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

17

Extension of Map-Reduce Architecture for Recursion

Necessarily, all tasks need to operate in rounds.

The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.

Page 18: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

18

Extension – (2)

Suppose some task S fails, and it never supplies the round-(i +1) input to T.

A replacement S’ for S is restarted at some other node.

The master knows that T has received up to round i from S, so it ignores the first i output files from S’.

Page 19: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

19

Extension – (3)

Master knows where all the inputs ever received by S are from, so it can provide those to S’.

Page 20: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

20

Checkpointing and State

Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere.

Tasks take input + state. Initially, state is empty.

Master can restart a task from some state and feed it only inputs received after that state was written.

Page 21: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

21

Example: Checkpointing

p(X,Y) :- p(X,Z) & p(Z,Y) Two groups of tasks:

1. Join tasks: hash on Z, using h(Z). Like tasks from previous example.

2. Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y). Receives tuples from join tasks. Distributes truly new tuples to join tasks.

Page 22: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

22

Example – (2)

.

.

.

Join tasks. Statehas p(x,y) if h(x)or h(y) is right.

Dup-elim tasks.State has p(x,y) ifh’(x,y) is right.

p(a,b)

to h’(a,b)

p(a,b)to h(a)and h(b)if new

Page 23: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

23

Example – Details

Each task writes “buffer” files locally, one for each of the tasks in the other rank.

The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.

Page 24: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

24

Example – Details – (2)

Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it.

Problem: the controller can’t be too eager to pass output files to their input, or files become tiny.

Page 25: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

25

Future Research

There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation. Check out Hive, PIG, as well as work

on multiway join optimization.

Page 26: 1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

26

Future Research – (2)

Almost everything is open about recursive Datalog implementation under map-reduce or similar systems. Seminaïve evaluation in general case. Architectures for managing failures.

• Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce.

When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?