# Communication Cost in Big Data Processing - mmds mmds-data.org/presentations/2014/suciu_ Output:

Post on 25-Aug-2018

212 views

TRANSCRIPT

Communication Cost in Big Data Processing

Dan Suciu University of Washington

Joint work with Paul Beame and Paris Koutris

and the Database Group at the UW

1

Queries on Big Data

Big Data processing on distributed clusters General programs: MapReduce, Spark SQL: PigLatin,, BigQuery, Shark, Myria

Traditional Data Processing Main cost = disk I/O Big Data Processing Main cost = communication

2

This Talk How much communication is needed to solve

a problem on p servers?

Problem: In this talk = Full Conjunctive Query CQ ( SQL) Future work = extend to linear algebra, ML, etc.

Some techniques discussed in this talk are implemented in Myria (Bill Howes talk)

3

Models of Communication

Combined communication/computation cost: PRAM MRC [Karloff2010] BSP [Valiant] LogP Model [Culler]

Explicit communication cost [Hong&Kung81] red/blue pebble games, for I/O

communication complexity [Ballard2012] extension to parallel algorithms

MapReduce model [DasSarma2013] MPC [Beame2013,Beame2014] this talk

4

Outline

The MPC Model

Communication Cost for Triangles

Communication Cost for CQs

Discussion

5

Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

Server 1 Server p . . . .

Input (size=M)

Input data = size M bits

Number of servers = p

O(M/p) O(M/p)

6

Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

Server 1 Server p . . . .

Server 1 Server p . . . .

Input (size=M)

Round 1

Input data = size M bits

One round = Compute & communicate

Number of servers = p

O(M/p) O(M/p)

7

Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

Server 1 Server p . . . .

Server 1 Server p . . . .

Server 1 Server p . . . .

Input (size=M)

. . . .

Round 1

Round 2

Round 3

. . . .

Input data = size M bits

Algorithm = Several rounds

One round = Compute & communicate

Number of servers = p

O(M/p) O(M/p)

8

Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

Server 1 Server p . . . .

Server 1 Server p . . . .

Server 1 Server p . . . .

Input (size=M)

. . . .

Round 1

Round 2

Round 3

. . . .

Input data = size M bits

Max communication load per server = L

Algorithm = Several rounds

One round = Compute & communicate

Number of servers = p L

L

L

L

L

L

O(M/p) O(M/p)

9

Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

Server 1 Server p . . . .

Server 1 Server p . . . .

Server 1 Server p . . . .

Input (size=M)

. . . .

Round 1

Round 2

Round 3

. . . .

Input data = size M bits

Max communication load per server = L

Ideal: L = M/p This talk: L = M/p * p, where in [0,1] is called space exponent Degenerate: L = M (send everything to one server)

Algorithm = Several rounds

One round = Compute & communicate

Number of servers = p L

L

L

L

L

L

O(M/p) O(M/p)

10

Example: Q(x,y,z) = R(x,y) S(y,z)

Server 1 Server p . . . . M/p M/p Input:

R, S uniformly partitioned on p servers

size(R)+size(S)=M

11

Example: Q(x,y,z) = R(x,y) S(y,z)

Server 1 Server p . . . .

Server 1 Server p . . . .

Round 1 O(M/p) O(M/p)

M/p M/p

Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

Input: R, S uniformly

partitioned on p servers

size(R)+size(S)=M

12

Example: Q(x,y,z) = R(x,y) S(y,z)

Server 1 Server p . . . .

Server 1 Server p . . . .

Round 1 O(M/p) O(M/p)

M/p M/p

R(x,y) S(y,z) R(x,y) S(y,z)

Output: Each server computes and outputs

the local join R(x,y) S(y,z)

Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

Input: R, S uniformly

partitioned on p servers

size(R)+size(S)=M

13

Example: Q(x,y,z) = R(x,y) S(y,z)

Server 1 Server p . . . .

Server 1 Server p . . . .

Round 1 O(M/p) O(M/p)

M/p M/p

R(x,y) S(y,z) R(x,y) S(y,z)

Output: Each server computes and outputs

the local join R(x,y) S(y,z)

Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

Input: R, S uniformly

partitioned on p servers

Assuming no Skew:y, degreeR(y), degreeS(y) M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)

size(R)+size(S)=M

Space exponent = 0

Example: Q(x,y,z) = R(x,y) S(y,z)

Server 1 Server p . . . .

Server 1 Server p . . . .

Round 1 O(M/p) O(M/p)

M/p M/p

R(x,y) S(y,z) R(x,y) S(y,z)

Output: Each server computes and outputs

the local join R(x,y) S(y,z)

Input: R, S uniformly

partitioned on p servers

Assuming no Skew:y, degreeR(y), degreeS(y) M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)

size(R)+size(S)=M

Space exponent = 0

SQL is embarrassingly

parallel!

Questions

Fix a query Q and a number of servers p

What is the communication load L needed to compute Q in one round? This talk based on [Beame2013, Beame2014]

Fix a maximum load L (space exponent ) How many rounds are needed for Q?Preliminary results in [Beame2013]

16

Outline

The MPC Model

Communication Cost for Triangles

Communication Cost for CQs

Discussion

17

The Triangle Query

Input: three tables R(X, Y), S(Y, Z), T(Z, X) size(R)+size(S)+size(T)=M

Output: compute Q(x,y,z) = R(x,y) S(y,z) T(z,x)

Z X

Fred Alice

Jack Jim

Fred Jim

Carol Alice

Y Z

Fred Alice

Jack Jim

Fred Jim

Carol Alice

X Y

Fred Alice

Jack Jim

Fred Jim

Carol Alice

R

S

T

18

Triangles in Two Rounds

Round 1: compute temp(X,Y,Z) = R(X,Y) S(Y,Z)

Round 2: compute Q(X,Y,Z) = temp(X,Y,Z) T(Z,X)

Load L depends on the size of temp. Can be much larger than M/p

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

Triangles in One Round

Factorize p = p p p Each server identified by (i,j,k)

i

j k

(i,j,k)

P

Place the servers in a cube: Server (i,j,k)

[Ganguli92, Afrati&Ullman10, Suri11, Beame13]

20

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

Triangles in One Round

k

(i,j,k)

P

Z X

Fred Alice

Jack Jim

Fred Jim

Carol Alice

Jack Jim Y Z

Fred Alice

Jack Jim

Fred Jim

Carol Alice

Jim Jack Jim Jack

X Y

Fred Alice

Jack Jim

Fred Jim

Carol Alice

R

S

T

i = h(Fred)

j = h(Jim)

Fred Jim Fred Jim

Fred Jim Fred Jim

Jim Jack

Jim Jack

Fred Jim Jim Jack

Jim Jack Fred Jim

Fred Jim

Jack Jim Jack Jim

Round 1: Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z))

Output: compute locally R(x,y) S(y,z) T(z,x)

21

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

Upper Bound is Lalgo = O(M/p)

22

Theorem Assume data has no skew: all degree O(M/p). Then the algorithm computes Q in one round, with communication load Lalgo = O(M/p)

Compare to two rounds: No more large intermediate result Skew-free up to degrees = O(M/p) (from O(M/p)) BUT: space exponent increased to = (from = 1)

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

M = size(R)+size(S)+size(T) (in bits)

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

M = size(R)+size(S)+size(T) (in bits)

23

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

M = size(R)+size(S)+size(T) (in bits)

24

Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

Assume that the three input relations R, S, T are stored on disjoint servers#

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

M = size(R)+size(S)+size(T) (in bits)

25

Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server

We prove a stronger claim, about any algorithm communicating < Llower bits Let L be the algorithms max communication load per server We prove: if L < Llower then, over random permutations R, S, T, the algorithm returns at most a fraction (L / Llower)3/2 of the answers to Q

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

Assume that the three input relations R, S, T are stored on disjoint servers#

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p Discussion: An algorithm with space exponent < , has load

L = M/p1- and reports only (L / Llower)3/2 = 1/p(1-3)/2 fraction of all answers. Fewer, as p increases!

Lower bound holds only for random inputs: cannot hold for fixed input R, S, T since the algorithm may use a short bit encoding to signal this input to all servers

By Yaos lemma, the result also holds for randomized algorithms, some fixed input

26

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

[Balard2012] consider Strassens matrix multiplication algorithm and provethe following theorem for the CDAG model. If each server has M c n

2

p2/lg7

memory, then the communication load per server:

IO(n, p,M) =

npM

lg7 Mp

!

Thus, if M = n2

p2/lg7then IO(n, p) =

n2

p2/lg7

Lower Bound is Llower = M / p

Stronger than MPC: no restriction on rounds Weaker than MPC

Applies only to an algorithm, not to a problem CDAG model of computation v.s. bit-model 27

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

M here same as n2 here

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

Proof

28

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

Proof

E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

29

Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

Proof

E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

30

One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it]

Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T

Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

Proof

E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

31

One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it]

Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T E [#answers to Q reported by server v] =

i,j,k aij bjk cki [(i,j aij2) * (j,k bjk2) * (k,i cki2)] [Friedgut] (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ] by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above

Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

Proof

E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

E [#answers to Q reported by all servers] v [ L(v) / M ]3/2 = p * [ L / M ] 3/2 = [ L / Llower ]3/2

32

Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it]

Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T E [#answers to Q reported by server v] =

i,j,k aij bjk cki [(i,j aij2) * (j,k bjk2) * (k,i cki2)] [Friedgut] (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ] by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above

size(R)+size(S)+size(T)=M

Outline

The MPC Model

Communication Cost for Triangles

Communication Cost for CQs

Discussion

33

General Formula Consider an arbitrary conjunctive query (CQ): We will: Give a lower bound Llower formula Give an algorithm with load Lalgo Prove that Llower = Lalgo

34

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`)

size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Cartesian Product Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2

Cartesian Product

S1(x)

S2 (y)

Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2

Lalgo

= 2

M

1

M2

p

12

Algorithm: factorize p = p1 p2 Send S1(x) to all servers (h(x),*)

Send S2(y) to all servers (*, h(y)) Load = M1 / p1 + M2 / p2;

Minimized when M1 / p1 = M2 / p2 = (M1M2/p)

Cartesian Product

37

S1(x)

S2 (y)

Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2

Lalgo

= 2

M

1

M2

p

12

Llower

= 2

M

1

M2

p

12

Lower bound: Let L(v) = L1(v) + L2(v) = load at server v The p servers must report all answers to Q:

M1 M2 v L1(v) * L2(v) v (L(v))2 p maxv (L(v))2

Algorithm: factorize p = p1 p2 Send S1(x) to all servers (h(x),*)

Send S2(y) to all servers (*, h(y)) Load = M1 / p1 + M2 / p2;

Minimized when M1 / p1 = M2 / p2 = (M1M2/p)

A Simple Observation Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Proof similar to cartesian product on previous slide

Definition An edge packing for Q is a set of atoms with no common variables:

U ={Sj1(xj1), Sj2(xj2), . . . , Sj|U|(xj|U|)} 8i, k : xji \ xjk = ;

Claim Any one-round algorithm for Q must also compute the cartesian product:

Q0(xj1 , . . . ,xj|U|) = Sj1(xj1) on Sj2(xj2) on . . . on Sj|U|(xj|U|)

Proof Because the algorithm doesnt know the other Sj s.

Llower

= |U | Mj1 Mj2 Mj|U|

p

1|U|

Assume that the three input relations S1, S2, are stored on disjoint servers#

#otherwise, add another factor |U|

The Lower Bound for CQ

39

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Theorem* Any one-round deterministic algorithm A for Q requires O(Llower) bits of communication per server:

Llower

=

M

1

u1 M2

u2 M`u`p

1u1+u2++u`

*Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS 2014

Note that we ignore constant factors (like |U| on the previous slide)

Definition A fractional edge packing are real numbers u1, u2, . . . , u` s.t.:

8j 2 [`] : uj

0

8i 2 [k] :X

j2[`]:xi2vars(Sj(xj))

uj

1

Triangle Query Revisited

40

Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R) = M1, size(S) = M2, size(T) = M3

Packing u1, u2, u3 L = (M1u1 * M2u2 * M3u3 / p)1/(u1+u2+u3)

, , (M1 M2 M3) / p

1, 0, 0 M1 / p 0, 1, 0 M2 / p 0, 0, 1 M3 / p

Llower = the largest of these four values When M1=M2=M3=M, then the largest value is the first: M / p

x y

z

An Algorithm for CQ [Afrati&Ullman2010] Factorize p = p1 p2 pk; a server is (v1, , vk) [p1] [pk] Send S1(x11, x12, ) to servers with coordinates h(x11), h(x12), Send S2(x21, x22, ) to servers with coordinates h(x21), h(x22),

Compute the query locally at each server

Load is O(Lalgo), where:

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

[A&U] use sum instead of max. With max we can compute the optimal p1, p2, , pk

Lalgo

= max

j2[`]

MjQ

i2[k]:xi2vars(Sj(xj)) pi

Llower = Lalgo

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Llower

=

Qj Mj

uj

p

1Pj uj

Apply log in base p. Denote j = logp Mj, ei = logp pi

Lalgo

= maxj

MjQ

i:xi2vars(Sj) pi

Llower = Lalgo

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Llower

=

Qj Mj

uj

p

1Pj uj

Apply log in base p. Denote j = logp Mj, ei = logp pi

minimize X

i

ei

1

8j : +X

j:xi2vars(Sj)

ei

j

Lalgo

= maxj

MjQ

i:xi2vars(Sj) pi

Llower = Lalgo

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Llower

=

Qj Mj

uj

p

1Pj uj

Apply log in base p. Denote j = logp Mj, ei = logp pi

minimize X

i

ei

1

8j : +X

j:xi2vars(Sj)

ei

j

maximize (X

j

fj

j

f0)

X

j

fj

1

8i :X

j:xi2vars(Sj)

fj

f0Primal/Dual uj = fj / f0

u0 = 1 / f0 at optimality: u0 = j uj

Lalgo

= maxj

MjQ

i:xi2vars(Sj) pi

Outline

The MPC Model

Communication Cost for Triangles

Communication Cost for CQs

Discussion

45

Discussion Communication costs for multiple rounds Lower/upper bounds in [Beame13] connect the

number of rounds to log of the query diameter Weaker communication model: algorithm sends

only tuples, not arbitrary bits Communication cost for skewed data Lower/upper bounds in [Beame14] have a gap of

poly log p Communication costs beyond full CQs Open!

46