Communication Cost in Big Data Processing - mmds mmds-data.org/presentations/2014/suciu_ Output:

Download Communication Cost in Big Data Processing - mmds mmds-data.org/presentations/2014/suciu_  Output:

Post on 25-Aug-2018

212 views

Category:

Documents

0 download

TRANSCRIPT

  • Communication Cost in Big Data Processing

    Dan Suciu University of Washington

    Joint work with Paul Beame and Paris Koutris

    and the Database Group at the UW

    1

  • Queries on Big Data

    Big Data processing on distributed clusters General programs: MapReduce, Spark SQL: PigLatin,, BigQuery, Shark, Myria

    Traditional Data Processing Main cost = disk I/O Big Data Processing Main cost = communication

    2

  • This Talk How much communication is needed to solve

    a problem on p servers?

    Problem: In this talk = Full Conjunctive Query CQ ( SQL) Future work = extend to linear algebra, ML, etc.

    Some techniques discussed in this talk are implemented in Myria (Bill Howes talk)

    3

  • Models of Communication

    Combined communication/computation cost: PRAM MRC [Karloff2010] BSP [Valiant] LogP Model [Culler]

    Explicit communication cost [Hong&Kung81] red/blue pebble games, for I/O

    communication complexity [Ballard2012] extension to parallel algorithms

    MapReduce model [DasSarma2013] MPC [Beame2013,Beame2014] this talk

    4

  • Outline

    The MPC Model

    Communication Cost for Triangles

    Communication Cost for CQs

    Discussion

    5

  • Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

    Server 1 Server p . . . .

    Input (size=M)

    Input data = size M bits

    Number of servers = p

    O(M/p) O(M/p)

    6

  • Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Input (size=M)

    Round 1

    Input data = size M bits

    One round = Compute & communicate

    Number of servers = p

    O(M/p) O(M/p)

    7

  • Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Input (size=M)

    . . . .

    Round 1

    Round 2

    Round 3

    . . . .

    Input data = size M bits

    Algorithm = Several rounds

    One round = Compute & communicate

    Number of servers = p

    O(M/p) O(M/p)

    8

  • Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Input (size=M)

    . . . .

    Round 1

    Round 2

    Round 3

    . . . .

    Input data = size M bits

    Max communication load per server = L

    Algorithm = Several rounds

    One round = Compute & communicate

    Number of servers = p L

    L

    L

    L

    L

    L

    O(M/p) O(M/p)

    9

  • Massively Parallel Communication Model (MPC) Extends BSP [Valiant]

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Input (size=M)

    . . . .

    Round 1

    Round 2

    Round 3

    . . . .

    Input data = size M bits

    Max communication load per server = L

    Ideal: L = M/p This talk: L = M/p * p, where in [0,1] is called space exponent Degenerate: L = M (send everything to one server)

    Algorithm = Several rounds

    One round = Compute & communicate

    Number of servers = p L

    L

    L

    L

    L

    L

    O(M/p) O(M/p)

    10

  • Example: Q(x,y,z) = R(x,y) S(y,z)

    Server 1 Server p . . . . M/p M/p Input:

    R, S uniformly partitioned on p servers

    size(R)+size(S)=M

    11

  • Example: Q(x,y,z) = R(x,y) S(y,z)

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Round 1 O(M/p) O(M/p)

    M/p M/p

    Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

    Input: R, S uniformly

    partitioned on p servers

    size(R)+size(S)=M

    12

  • Example: Q(x,y,z) = R(x,y) S(y,z)

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Round 1 O(M/p) O(M/p)

    M/p M/p

    R(x,y) S(y,z) R(x,y) S(y,z)

    Output: Each server computes and outputs

    the local join R(x,y) S(y,z)

    Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

    Input: R, S uniformly

    partitioned on p servers

    size(R)+size(S)=M

    13

  • Example: Q(x,y,z) = R(x,y) S(y,z)

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Round 1 O(M/p) O(M/p)

    M/p M/p

    R(x,y) S(y,z) R(x,y) S(y,z)

    Output: Each server computes and outputs

    the local join R(x,y) S(y,z)

    Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

    Input: R, S uniformly

    partitioned on p servers

    Assuming no Skew:y, degreeR(y), degreeS(y) M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)

    size(R)+size(S)=M

    Space exponent = 0

  • Example: Q(x,y,z) = R(x,y) S(y,z)

    Server 1 Server p . . . .

    Server 1 Server p . . . .

    Round 1 O(M/p) O(M/p)

    M/p M/p

    R(x,y) S(y,z) R(x,y) S(y,z)

    Output: Each server computes and outputs

    the local join R(x,y) S(y,z)

    Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p

    Input: R, S uniformly

    partitioned on p servers

    Assuming no Skew:y, degreeR(y), degreeS(y) M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)

    size(R)+size(S)=M

    Space exponent = 0

    SQL is embarrassingly

    parallel!

  • Questions

    Fix a query Q and a number of servers p

    What is the communication load L needed to compute Q in one round? This talk based on [Beame2013, Beame2014]

    Fix a maximum load L (space exponent ) How many rounds are needed for Q?Preliminary results in [Beame2013]

    16

  • Outline

    The MPC Model

    Communication Cost for Triangles

    Communication Cost for CQs

    Discussion

    17

  • The Triangle Query

    Input: three tables R(X, Y), S(Y, Z), T(Z, X) size(R)+size(S)+size(T)=M

    Output: compute Q(x,y,z) = R(x,y) S(y,z) T(z,x)

    Z X

    Fred Alice

    Jack Jim

    Fred Jim

    Carol Alice

    Y Z

    Fred Alice

    Jack Jim

    Fred Jim

    Carol Alice

    X Y

    Fred Alice

    Jack Jim

    Fred Jim

    Carol Alice

    R

    S

    T

    18

  • Triangles in Two Rounds

    Round 1: compute temp(X,Y,Z) = R(X,Y) S(Y,Z)

    Round 2: compute Q(X,Y,Z) = temp(X,Y,Z) T(Z,X)

    Load L depends on the size of temp. Can be much larger than M/p

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

  • Triangles in One Round

    Factorize p = p p p Each server identified by (i,j,k)

    i

    j k

    (i,j,k)

    P

    Place the servers in a cube: Server (i,j,k)

    [Ganguli92, Afrati&Ullman10, Suri11, Beame13]

    20

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

  • Triangles in One Round

    k

    (i,j,k)

    P

    Z X

    Fred Alice

    Jack Jim

    Fred Jim

    Carol Alice

    Jack Jim Y Z

    Fred Alice

    Jack Jim

    Fred Jim

    Carol Alice

    Jim Jack Jim Jack

    X Y

    Fred Alice

    Jack Jim

    Fred Jim

    Carol Alice

    R

    S

    T

    i = h(Fred)

    j = h(Jim)

    Fred Jim Fred Jim

    Fred Jim Fred Jim

    Jim Jack

    Jim Jack

    Fred Jim Jim Jack

    Jim Jack Fred Jim

    Fred Jim

    Jack Jim Jack Jim

    Round 1: Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z))

    Output: compute locally R(x,y) S(y,z) T(z,x)

    21

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

  • Upper Bound is Lalgo = O(M/p)

    22

    Theorem Assume data has no skew: all degree O(M/p). Then the algorithm computes Q in one round, with communication load Lalgo = O(M/p)

    Compare to two rounds: No more large intermediate result Skew-free up to degrees = O(M/p) (from O(M/p)) BUT: space exponent increased to = (from = 1)

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    M = size(R)+size(S)+size(T) (in bits)

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    # without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

    M = size(R)+size(S)+size(T) (in bits)

    23

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    # without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

    M = size(R)+size(S)+size(T) (in bits)

    24

    Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    Assume that the three input relations R, S, T are stored on disjoint servers#

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    M = size(R)+size(S)+size(T) (in bits)

    25

    Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server

    We prove a stronger claim, about any algorithm communicating < Llower bits Let L be the algorithms max communication load per server We prove: if L < Llower then, over random permutations R, S, T, the algorithm returns at most a fraction (L / Llower)3/2 of the answers to Q

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    # without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013

    Assume that the three input relations R, S, T are stored on disjoint servers#

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p Discussion: An algorithm with space exponent < , has load

    L = M/p1- and reports only (L / Llower)3/2 = 1/p(1-3)/2 fraction of all answers. Fewer, as p increases!

    Lower bound holds only for random inputs: cannot hold for fixed input R, S, T since the algorithm may use a short bit encoding to signal this input to all servers

    By Yaos lemma, the result also holds for randomized algorithms, some fixed input

    26

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M

  • [Balard2012] consider Strassens matrix multiplication algorithm and provethe following theorem for the CDAG model. If each server has M c n

    2

    p2/lg7

    memory, then the communication load per server:

    IO(n, p,M) =

    npM

    lg7 Mp

    !

    Thus, if M = n2

    p2/lg7then IO(n, p) =

    n2

    p2/lg7

    Lower Bound is Llower = M / p

    Stronger than MPC: no restriction on rounds Weaker than MPC

    Applies only to an algorithm, not to a problem CDAG model of computation v.s. bit-model 27

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    M here same as n2 here

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    Proof

    28

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    Proof

    E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

    29

    Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    Proof

    E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

    30

    One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it]

    Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T

    Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    Proof

    E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

    31

    One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it]

    Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T E [#answers to Q reported by server v] =

    i,j,k aij bjk cki [(i,j aij2) * (j,k bjk2) * (k,i cki2)] [Friedgut] (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ] by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above

    Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

    size(R)+size(S)+size(T)=M

  • Lower Bound is Llower = M / p

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X)

    Proof

    E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1

    E [#answers to Q reported by all servers] v [ L(v) / M ]3/2 = p * [ L / M ] 3/2 = [ L / Llower ]3/2

    32

    Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T

    One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it]

    Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T E [#answers to Q reported by server v] =

    i,j,k aij bjk cki [(i,j aij2) * (j,k bjk2) * (k,i cki2)] [Friedgut] (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ] by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above

    size(R)+size(S)+size(T)=M

  • Outline

    The MPC Model

    Communication Cost for Triangles

    Communication Cost for CQs

    Discussion

    33

  • General Formula Consider an arbitrary conjunctive query (CQ): We will: Give a lower bound Llower formula Give an algorithm with load Lalgo Prove that Llower = Lalgo

    34

    Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`)

    size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

  • Cartesian Product Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2

  • Cartesian Product

    S1(x)

    S2 (y)

    Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2

    Lalgo

    = 2

    M

    1

    M2

    p

    12

    Algorithm: factorize p = p1 p2 Send S1(x) to all servers (h(x),*)

    Send S2(y) to all servers (*, h(y)) Load = M1 / p1 + M2 / p2;

    Minimized when M1 / p1 = M2 / p2 = (M1M2/p)

  • Cartesian Product

    37

    S1(x)

    S2 (y)

    Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2

    Lalgo

    = 2

    M

    1

    M2

    p

    12

    Llower

    = 2

    M

    1

    M2

    p

    12

    Lower bound: Let L(v) = L1(v) + L2(v) = load at server v The p servers must report all answers to Q:

    M1 M2 v L1(v) * L2(v) v (L(v))2 p maxv (L(v))2

    Algorithm: factorize p = p1 p2 Send S1(x) to all servers (h(x),*)

    Send S2(y) to all servers (*, h(y)) Load = M1 / p1 + M2 / p2;

    Minimized when M1 / p1 = M2 / p2 = (M1M2/p)

  • A Simple Observation Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

    Proof similar to cartesian product on previous slide

    Definition An edge packing for Q is a set of atoms with no common variables:

    U ={Sj1(xj1), Sj2(xj2), . . . , Sj|U|(xj|U|)} 8i, k : xji \ xjk = ;

    Claim Any one-round algorithm for Q must also compute the cartesian product:

    Q0(xj1 , . . . ,xj|U|) = Sj1(xj1) on Sj2(xj2) on . . . on Sj|U|(xj|U|)

    Proof Because the algorithm doesnt know the other Sj s.

    Llower

    = |U | Mj1 Mj2 Mj|U|

    p

    1|U|

    Assume that the three input relations S1, S2, are stored on disjoint servers#

    #otherwise, add another factor |U|

  • The Lower Bound for CQ

    39

    Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

    Theorem* Any one-round deterministic algorithm A for Q requires O(Llower) bits of communication per server:

    Llower

    =

    M

    1

    u1 M2

    u2 M`u`p

    1u1+u2++u`

    *Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS 2014

    Note that we ignore constant factors (like |U| on the previous slide)

    Definition A fractional edge packing are real numbers u1, u2, . . . , u` s.t.:

    8j 2 [`] : uj

    0

    8i 2 [k] :X

    j2[`]:xi2vars(Sj(xj))

    uj

    1

  • Triangle Query Revisited

    40

    Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R) = M1, size(S) = M2, size(T) = M3

    Packing u1, u2, u3 L = (M1u1 * M2u2 * M3u3 / p)1/(u1+u2+u3)

    , , (M1 M2 M3) / p

    1, 0, 0 M1 / p 0, 1, 0 M2 / p 0, 0, 1 M3 / p

    Llower = the largest of these four values When M1=M2=M3=M, then the largest value is the first: M / p

    x y

    z

  • An Algorithm for CQ [Afrati&Ullman2010] Factorize p = p1 p2 pk; a server is (v1, , vk) [p1] [pk] Send S1(x11, x12, ) to servers with coordinates h(x11), h(x12), Send S2(x21, x22, ) to servers with coordinates h(x21), h(x22),

    Compute the query locally at each server

    Load is O(Lalgo), where:

    Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

    [A&U] use sum instead of max. With max we can compute the optimal p1, p2, , pk

    Lalgo

    = max

    j2[`]

    MjQ

    i2[k]:xi2vars(Sj(xj)) pi

  • Llower = Lalgo

    Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

    Llower

    =

    Qj Mj

    uj

    p

    1Pj uj

    Apply log in base p. Denote j = logp Mj, ei = logp pi

    Lalgo

    = maxj

    MjQ

    i:xi2vars(Sj) pi

  • Llower = Lalgo

    Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

    Llower

    =

    Qj Mj

    uj

    p

    1Pj uj

    Apply log in base p. Denote j = logp Mj, ei = logp pi

    minimize X

    i

    ei

    1

    8j : +X

    j:xi2vars(Sj)

    ei

    j

    Lalgo

    = maxj

    MjQ

    i:xi2vars(Sj) pi

  • Llower = Lalgo

    Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`

    Llower

    =

    Qj Mj

    uj

    p

    1Pj uj

    Apply log in base p. Denote j = logp Mj, ei = logp pi

    minimize X

    i

    ei

    1

    8j : +X

    j:xi2vars(Sj)

    ei

    j

    maximize (X

    j

    fj

    j

    f0)

    X

    j

    fj

    1

    8i :X

    j:xi2vars(Sj)

    fj

    f0Primal/Dual uj = fj / f0

    u0 = 1 / f0 at optimality: u0 = j uj

    Lalgo

    = maxj

    MjQ

    i:xi2vars(Sj) pi

  • Outline

    The MPC Model

    Communication Cost for Triangles

    Communication Cost for CQs

    Discussion

    45

  • Discussion Communication costs for multiple rounds Lower/upper bounds in [Beame13] connect the

    number of rounds to log of the query diameter Weaker communication model: algorithm sends

    only tuples, not arbitrary bits Communication cost for skewed data Lower/upper bounds in [Beame14] have a gap of

    poly log p Communication costs beyond full CQs Open!

    46