1 cluster computing and datalog recursion via map-reduce seminaïve evaluation re-engineering...

1

Cluster Computing and Datalog

Recursion Via Map-ReduceSeminaïve Evaluation

Re-engineering Map-Reduce for Recursion

2

Acknowledgements

Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar

contributed to the architecture discussions.

3

Implementing Datalog via Map-Reduce

Joins are straightforward to implement as a round of map-reduce.

Likewise, union/duplicate-elimination is a round of map-reduce.

But implementation of a recursion can thus take many rounds of map-reduce.

4

Seminaïve Evaluation

Specific combination of joins and unions.

Example: chain ruleq(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

Let r, s, t = “old” relations; r’, s’, t’ = incremental relations.

Simplification: assume |r’| = a|r|, etc.

5

A 3-Way Join Using Map-Reduce

q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Use k compute nodes. Give X and Y shares to determine

the reduce-task that gets each tuple.

Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|.

6

Seminaïve Evaluation – (2)

Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’

Obvious method for computing a round of seminaïve evaluation: Replicate r and r’; replicate t and t’; do

not replicate s or s’. Communication = (1+a)(|s| + 2k|r||t|)

7

Seminaïve Evaluation – (3)

There are many other ways we might use k nodes to do the same task.

Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’.

Theorem: no grouping does better than the obvious method for this example.

8

Networks of Processes for Recursions

Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost?

Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors.

9

Example: Very Simple Recursion

p(X,Y) :- e(X,Z) & p(Z,Y);p(X,Y) :- p0(X,Y);

Use k compute nodes. Hash Y-values to one of k buckets h(Y). Each node gets a complete copy of e. p0 is distributed among the k nodes,

with p0(x,y) going to node h(y).

10

Example – Continued

p(X,Y) :- e(X,Z) & p(Z,Y) Each node applies the recursive rule

and generates new tuples p(x,y). Key point: since new tuples have a Y-

value that hashes to the same node, no communication is necessary.

Duplicates are eliminated locally.

11

Harder Case of Recursion

Consider a recursive rulep(X,Y) :- p(X,Z) & p(Z,Y)

Responsibility divided among compute nodes by hashing Z-values.

Node n gets tuple p(a,b) if either h(a) = n or h(b) = n.

12

Compute Node for h(Z) = n

Node forh(Z) = n

Remember allReceived tuples

(eliminateduplicates)

p(a,b) ifh(a) = nor h(b) = n

p(c,d)produced

To nodesfor h(c) and h(d)

Search formatches

13

Comparison with Iteration

Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds.

Disadvantage: Tasks run longer, more likely to fail.

14

Node Failures

To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes.

But recursions can’t work that way. What happens if a node fails after

some of its output has been consumed?

15

Node Failures – (2)

Actually, there is no problem! We restart the tasks of the failed

node at another node. The replacement task will send

some data that the failed task also sent.

But each node remembers tuples to eliminate duplicates anyway.

16

Node Failures – (3)

But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets.

Argument would fail if we were computing bags or aggregations of the tuples produced.

Similar problems for other recursions, e.g., PDE’s.

17

Extension of Map-Reduce Architecture for Recursion

Necessarily, all tasks need to operate in rounds.

The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.

18

Extension – (2)

Suppose some task S fails, and it never supplies the round-(i +1) input to T.

A replacement S’ for S is restarted at some other node.

The master knows that T has received up to round i from S, so it ignores the first i output files from S’.

19

Extension – (3)

Master knows where all the inputs ever received by S are from, so it can provide those to S’.

20

Checkpointing and State

Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere.

Tasks take input + state. Initially, state is empty.

Master can restart a task from some state and feed it only inputs received after that state was written.

21

Example: Checkpointing

p(X,Y) :- p(X,Z) & p(Z,Y) Two groups of tasks:

1. Join tasks: hash on Z, using h(Z). Like tasks from previous example.

2. Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y). Receives tuples from join tasks. Distributes truly new tuples to join tasks.

22

Example – (2)

.

.

.

Join tasks. Statehas p(x,y) if h(x)or h(y) is right.

Dup-elim tasks.State has p(x,y) ifh’(x,y) is right.

p(a,b)

to h’(a,b)

p(a,b)to h(a)and h(b)if new

23

Example – Details

Each task writes “buffer” files locally, one for each of the tasks in the other rank.

The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.

24

Example – Details – (2)

Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it.

Problem: the controller can’t be too eager to pass output files to their input, or files become tiny.

25

Future Research

There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation. Check out Hive, PIG, as well as work

on multiway join optimization.

26

Future Research – (2)

Almost everything is open about recursive Datalog implementation under map-reduce or similar systems. Seminaïve evaluation in general case. Architectures for managing failures.

• Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce.

When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?

1 cluster computing and datalog recursion via map-reduce seminaïve evaluation re-engineering...

Documents

node n

node hy

failed node

yeach node

round of map

iterated map

r replicate t

y shares