communication cost in big data processing - mmds...

Communication Cost in Big Data Processing

Dan Suciu University of Washington

Joint work with Paul Beame and Paris Koutris

and the Database Group at the UW

1

Queries on Big Data

Big Data processing on distributed clusters •  General programs: MapReduce, Spark •  SQL: PigLatin,, BigQuery, Shark, Myria…

Traditional Data Processing •  Main cost = disk I/O Big Data Processing •  Main cost = communication

2

This Talk •  How much communication is needed to solve

a “problem” on p servers?

•  “Problem”: –  In this talk = Full Conjunctive Query CQ (≈ SQL) – Future work = extend to linear algebra, ML, etc.

•  Some techniques discussed in this talk are implemented in Myria (Bill Howe’s talk)

3

Models of Communication

•  Combined communication/computation cost: – PRAM à MRC [Karloff’2010] – BSP [Valiant] à LogP Model [Culler]

•  Explicit communication cost –  [Hong&Kung’81] red/blue pebble games, for I/O

communication complexity à [Ballard’2012] extension to parallel algorithms

– MapReduce model [DasSarma’2013] MPC [Beame’2013,Beame’2014] – this talk

4

Outline

•  The MPC Model

•  Communication Cost for Triangles

•  Communication Cost for CQ’s

•  Discussion

5

Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

Server 1 Server p . . . .

Input (size=M)

Input data = size M bits

Number of servers = p

O(M/p) O(M/p)

6




Input (size=M)

Round 1


One round = Compute & communicate


O(M/p) O(M/p)

7





Input (size=M)

. . . .

Round 1

Round 2

Round 3

. . . .


Algorithm = Several rounds



O(M/p) O(M/p)

8





Input (size=M)

. . . .

Round 1

Round 2

Round 3

. . . .


Max communication load per server = L



Number of servers = p L

L

L

L

L

L

O(M/p) O(M/p)

9





Input (size=M)

. . . .

Round 1

Round 2

Round 3

. . . .


Max communication load per server = L

Ideal: L = M/p This talk: L = M/p * pε, where ε in [0,1] is called space exponent Degenerate: L = M (send everything to one server)



Number of servers = p L

L

L

L

L

L

O(M/p) O(M/p)

10

Example: Q(x,y,z) = R(x,y) ⋈ S(y,z)

Server 1 Server p . . . . M/p M/p Input:

•  R, S uniformly partitioned on p servers

size(R)+size(S)=M

11




Round 1 O(M/p) O(M/p)

M/p M/p

Round 1: each server •  Sends record R(x,y) to server h(y) mod p •  Sends record S(y,z) to server h(y) mod p

Input: •  R, S uniformly

partitioned on p servers

size(R)+size(S)=M

12





M/p M/p

R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)

Output: •  Each server computes and outputs

the local join R(x,y) ⋈ S(y,z)




size(R)+size(S)=M

13





M/p M/p

R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)






Assuming no Skew:∀y, degreeR(y), degreeS(y) ≤ M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)

size(R)+size(S)=M

Space exponent = 0





M/p M/p

R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)






Assuming no Skew:∀y, degreeR(y), degreeS(y) ≤ M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)

size(R)+size(S)=M

Space exponent = 0

SQL is embarrassingly

parallel!

Questions

Fix a query Q and a number of servers p

•  What is the communication load L needed to compute Q in one round? This talk based on [Beame’2013, Beame’2014]

•  Fix a maximum load L (space exponent ε) How many rounds are needed for Q?Preliminary results in [Beame’2013]

16

Outline




•  Discussion

17

The Triangle Query

•  Input: three tables R(X, Y), S(Y, Z), T(Z, X) size(R)+size(S)+size(T)=M

•  Output: compute Q(x,y,z) = R(x,y) ⋈ S(y,z) ⋈ T(z,x)

Z X

Fred Alice

Jack Jim

Fred Jim

Carol Alice

…

Y Z

Fred Alice

Jack Jim

Fred Jim

Carol Alice

…

X Y

Fred Alice

Jack Jim

Fred Jim

Carol Alice

…

R

S

T

18

Triangles in Two Rounds

•  Round 1: compute temp(X,Y,Z) = R(X,Y) ⋈ S(Y,Z)

•  Round 2: compute Q(X,Y,Z) = temp(X,Y,Z) ⋈ T(Z,X)

Load L depends on the size of temp. Can be much larger than M/p

Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R)+size(S)+size(T)=M

Triangles in One Round

•  Factorize p = p⅓ × p⅓ × p⅓

•  Each server identified by (i,j,k)

i

j k

(i,j,k)

P⅓

Place the servers in a cube: Server (i,j,k)

[Ganguli’92, Afrati&Ullman’10, Suri’11, Beame’13]

20


Triangles in One Round

k

(i,j,k)

P⅓

Z X

Fred Alice

Jack Jim

Fred Jim

Carol Alice

…

Jack Jim Y Z

Fred Alice

Jack Jim

Fred Jim

Carol Alice

Jim Jack Jim Jack

X Y

Fred Alice

Jack Jim

Fred Jim

Carol Alice

…

R

S

T

i = h(Fred)

j = h(Jim)

Fred Jim Fred Jim

Fred Jim Fred Jim

Jim Jack

Jim Jack

Fred Jim Jim Jack

Jim Jack Fred Jim

Fred Jim

Jack Jim Jack Jim

Round 1: Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z))

Output: compute locally R(x,y) ⋈ S(y,z) ⋈ T(z,x)

21


Upper Bound is Lalgo = O(M/p⅔)

22

Theorem Assume data has no skew: all degree ≤ O(M/p⅓). Then the algorithm computes Q in one round, with communication load Lalgo = O(M/p⅔)

Compare to two rounds: •  No more large intermediate result •  Skew-free up to degrees = O(M/p⅓) (from O(M/p)) •  BUT: space exponent increased to ε = ⅓ (from ε = 1)

Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

M = size(R)+size(S)+size(T) (in bits)

size(R)+size(S)+size(T)=M

Lower Bound is Llower = M / p⅔

# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013


23





24

Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server

Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

Assume that the three input relations R, S, T are stored on disjoint servers#




25

Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server

We prove a stronger claim, about any algorithm communicating < Llower bits Let L be the algorithm’s max communication load per server We prove: if L < Llower then, over random permutations R, S, T, the algorithm returns at most a fraction (L / Llower)3/2 of the answers to Q

Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)


Assume that the three input relations R, S, T are stored on disjoint servers#


Lower Bound is Llower = M / p⅔ Discussion: •  An algorithm with space exponent ε < ⅓, has load

L = M/p1-ε and reports only (L / Llower)3/2 = 1/p(1-3ε)/2 fraction of all answers. Fewer, as p increases!

•  Lower bound holds only for random inputs: cannot hold for fixed input R, S, T since the algorithm may use a short bit encoding to signal this input to all servers

•  By Yao’s lemma, the result also holds for randomized algorithms, some fixed input

26


[Balard’2012] consider Strassen’s matrix multiplication algorithm and provethe following theorem for the CDAG model. If each server has M c · n2

p2/lg7

memory, then the communication load per server:

IO(n, p,M) = ⌦

✓npM

◆lg7 M

p

!

Thus, if M = n2

p2/lg7 then IO(n, p) = ⌦⇣

n2

p2/lg7

⌘


•  Stronger than MPC: no restriction on rounds •  Weaker than MPC

– Applies only to an algorithm, not to a problem – CDAG model of computation v.s. bit-model 27

Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

M here same as n2 here



Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

Proof

28



Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

Proof

E [#answers to Q] = Σi,j,k Pr [(i,j,k) ∈ R⋈S⋈T] = n3 * (1/n)3 = 1

29

Input: size = M = 3 log n! •  R, S, T random permutations on [n]: Pr [(i,j) ∈R] = 1/n, same for S, T



Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

Proof


30

One server v ∈[p]: receives L(v) = LR(v) + LS(v) + LT(v) bits •  Denote aij = Pr [(i,j)∈R and v knows it]

Then (1) aij ≤ 1/n (obvious) (2) Σi,j aij ≤ LR(v) / log n (formal proof: entropy) •  Denote similarly bjk,cki for S and T




Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

Proof


31


Then (1) aij ≤ 1/n (obvious) (2) Σi,j aij ≤ LR(v) / log n (formal proof: entropy) •  Denote similarly bjk,cki for S and T •  E [#answers to Q reported by server v] =

Σi,j,k aij bjk cki ≤ [(Σi,j aij2) * (Σj,k bjk

2) * (Σk,i cki2)]½ [Friedgut]

≤ (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ]½ by (1), (2)

≤ (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above




Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)

Proof


E [#answers to Q reported by all servers] ≤ Σv [ L(v) / M ]3/2 = p * [ L / M ] 3/2 = [ L / Llower ]3/2

32



Then (1) aij ≤ 1/n (obvious) (2) Σi,j aij ≤ LR(v) / log n (formal proof: entropy) •  Denote similarly bjk,cki for S and T •  E [#answers to Q reported by server v] =

Σi,j,k aij bjk cki ≤ [(Σi,j aij2) * (Σj,k bjk

2) * (Σk,i cki2)]½ [Friedgut]

≤ (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ]½ by (1), (2)

≤ (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above


Outline




•  Discussion

33

General Formula Consider an arbitrary conjunctive query (CQ): We will: •  Give a lower bound Llower formula •  Give an algorithm with load Lalgo •  Prove that Llower = Lalgo

34

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`)

size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Cartesian Product Q(x,y) = S1(x) × S2(y) size(S1)=M1, size(S2) = M2

Cartesian Product

S1(x) à

S2 (y) à

Q(x,y) = S1(x) × S2(y) size(S1)=M1, size(S2) = M2

Lalgo

= 2

✓M

1

·M2

p

◆ 12

Algorithm: factorize p = p1 × p2

•  Send S1(x) to all servers (h(x),*) Send S2(y) to all servers (*, h(y))

•  Load = M1 / p1 + M2 / p2; Minimized when M1 / p1 = M2 / p2 = (M1M2/p)½

Cartesian Product

37

S1(x) à

S2 (y) à

Q(x,y) = S1(x) × S2(y) size(S1)=M1, size(S2) = M2

Lalgo

= 2

✓M

1

·M2

p

◆ 12

Llower

= 2

✓M

1

·M2

p

◆ 12

Lower bound: •  Let L(v) = L1(v) + L2(v) = load at server v •  The p servers must report all answers to Q:

M1 M2 ≤ Σv L1(v) * L2(v) ≤ ½ Σv (L(v))2 ≤ ½ p maxv (L(v))2

Algorithm: factorize p = p1 × p2

•  Send S1(x) to all servers (h(x),*) Send S2(y) to all servers (*, h(y))

•  Load = M1 / p1 + M2 / p2; Minimized when M1 / p1 = M2 / p2 = (M1M2/p)½

A Simple Observation Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Proof similar to cartesian product on previous slide

Definition An edge packing for Q is a set of atoms with no common variables:

U ={Sj1(xj1), Sj2(xj2), . . . , Sj|U|(xj|U|)} 8i, k : xji \ xjk = ;

Claim Any one-round algorithm for Q must also compute the cartesian product:

Q0(xj1 , . . . ,xj|U|) = Sj1(xj1) on Sj2(xj2) on . . . on Sj|U|(xj|U|)

Proof Because the algorithm doesn’t know the other Sj ’s.

Llower

= |U | ·✓Mj1 ·Mj2 · · ·Mj|U|

p

◆ 1|U|

Assume that the three input relations S1, S2, … are stored on disjoint servers#

#otherwise, add another factor |U|

The Lower Bound for CQ

39

Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

Theorem* Any one-round deterministic algorithm A for Q requires O(Llower) bits of communication per server:

Llower

=

✓M

1

u1 ·M2

u2 · · ·M`u`

p

◆ 1u1+u2+···+u`

*Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS 2014

Note that we ignore constant factors (like |U| on the previous slide)

Definition A fractional edge packing are real numbers u1, u2, . . . , u`

s.t.:

8j 2 [`] : uj

� 0

8i 2 [k] :X

j2[`]:xi2vars(Sj(xj))

uj

1

Triangle Query Revisited

40

Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R) = M1, size(S) = M2, size(T) = M3

Packing u1, u2, u3 L = (M1u1 * M2

u2 * M3u3 / p)1/(u1+u2+u3)

½, ½, ½ (M1 M2 M3)⅓ / p⅔

1, 0, 0 M1 / p 0, 1, 0 M2 / p 0, 0, 1 M3 / p

Llower = the largest of these four values When M1=M2=M3=M, then the largest value is the first: M / p⅔

x y

z

½

½ ½

An Algorithm for CQ [Afrati&Ullman’2010] Factorize p = p1 × p2 × … × pk; a server is (v1, …, vk) ∈[p1] × … × [pk] •  Send S1(x11, x12, …) to servers with coordinates h(x11), h(x12), … •  Send S2(x21, x22, …) to servers with coordinates h(x21), h(x22), … •  …

•  Compute the query locally at each server

•  Load is O(Lalgo), where:


[A&U] use sum instead of max. With max we can compute the optimal p1, p2, …, pk

Lalgo

= max

j2[`]

MjQ

i2[k]:xi2vars(Sj(xj))pi

Llower = Lalgo


Llower

=

✓Qj Mj

uj

p

◆ 1Pj uj

Apply log in base p. Denote µj = logp Mj, ei = logp pi

Lalgo

= max

j

MjQ

i:xi2vars(Sj)pi

Llower = Lalgo


Llower

=

✓Qj Mj

uj

p

◆ 1Pj uj


minimize �X

i

ei

1

8j : �+X

j:xi2vars(Sj)

ei

� µj

Lalgo

= max

j

MjQ

i:xi2vars(Sj)pi

Llower = Lalgo


Llower

=

✓Qj Mj

uj

p

◆ 1Pj uj


minimize �X

i

ei

1

8j : �+X

j:xi2vars(Sj)

ei

� µj

maximize (X

j

fj

µj

� f0)

X

j

fj

1

8i :X

j:xi2vars(Sj)

fj

f0Primal/Dual uj = fj / f0

u0 = 1 / f0 at optimality: u0 = Σj uj

Lalgo

= max

j

MjQ

i:xi2vars(Sj)pi

Outline




•  Discussion

45

Discussion Communication costs for multiple rounds •  Lower/upper bounds in [Beame’13] connect the

number of rounds to log of the query diameter •  Weaker communication model: algorithm sends

only tuples, not arbitrary bits Communication cost for skewed data •  Lower/upper bounds in [Beame’14] have a gap of

poly log p Communication costs beyond full CQ’s •  Open!

46

communication cost in big data processing - mmds...

Documents