Communication Cost in Big Data Processing - mmds mmds-data.org/presentations/2014/suciu_  · Output:…

Download Communication Cost in Big Data Processing - mmds mmds-data.org/presentations/2014/suciu_  · Output:…

Post on 25-Aug-2018

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Communication Cost in Big Data Processing </p><p>Dan Suciu University of Washington </p><p> Joint work with Paul Beame and Paris Koutris </p><p>and the Database Group at the UW </p><p>1 </p></li><li><p>Queries on Big Data </p><p>Big Data processing on distributed clusters General programs: MapReduce, Spark SQL: PigLatin,, BigQuery, Shark, Myria </p><p>Traditional Data Processing Main cost = disk I/O Big Data Processing Main cost = communication </p><p>2 </p></li><li><p>This Talk How much communication is needed to solve </p><p>a problem on p servers? </p><p> Problem: In this talk = Full Conjunctive Query CQ ( SQL) Future work = extend to linear algebra, ML, etc. </p><p> Some techniques discussed in this talk are implemented in Myria (Bill Howes talk) </p><p>3 </p></li><li><p>Models of Communication </p><p> Combined communication/computation cost: PRAM MRC [Karloff2010] BSP [Valiant] LogP Model [Culler] </p><p> Explicit communication cost [Hong&amp;Kung81] red/blue pebble games, for I/O </p><p>communication complexity [Ballard2012] extension to parallel algorithms </p><p>MapReduce model [DasSarma2013] MPC [Beame2013,Beame2014] this talk </p><p>4 </p></li><li><p>Outline </p><p> The MPC Model </p><p> Communication Cost for Triangles </p><p> Communication Cost for CQs </p><p> Discussion </p><p>5 </p></li><li><p>Massively Parallel Communication Model (MPC) Extends BSP [Valiant] </p><p>Server 1 Server p . . . . </p><p>Input (size=M) </p><p>Input data = size M bits </p><p>Number of servers = p </p><p>O(M/p) O(M/p) </p><p>6 </p></li><li><p>Massively Parallel Communication Model (MPC) Extends BSP [Valiant] </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Input (size=M) </p><p>Round 1 </p><p>Input data = size M bits </p><p>One round = Compute &amp; communicate </p><p>Number of servers = p </p><p>O(M/p) O(M/p) </p><p>7 </p></li><li><p>Massively Parallel Communication Model (MPC) Extends BSP [Valiant] </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Input (size=M) </p><p>. . . . </p><p>Round 1 </p><p>Round 2 </p><p>Round 3 </p><p>. . . . </p><p>Input data = size M bits </p><p>Algorithm = Several rounds </p><p>One round = Compute &amp; communicate </p><p>Number of servers = p </p><p>O(M/p) O(M/p) </p><p>8 </p></li><li><p>Massively Parallel Communication Model (MPC) Extends BSP [Valiant] </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Input (size=M) </p><p>. . . . </p><p>Round 1 </p><p>Round 2 </p><p>Round 3 </p><p>. . . . </p><p>Input data = size M bits </p><p>Max communication load per server = L </p><p>Algorithm = Several rounds </p><p>One round = Compute &amp; communicate </p><p>Number of servers = p L </p><p>L </p><p>L </p><p>L </p><p>L </p><p>L </p><p>O(M/p) O(M/p) </p><p>9 </p></li><li><p>Massively Parallel Communication Model (MPC) Extends BSP [Valiant] </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Input (size=M) </p><p>. . . . </p><p>Round 1 </p><p>Round 2 </p><p>Round 3 </p><p>. . . . </p><p>Input data = size M bits </p><p>Max communication load per server = L </p><p> Ideal: L = M/p This talk: L = M/p * p, where in [0,1] is called space exponent Degenerate: L = M (send everything to one server) </p><p>Algorithm = Several rounds </p><p>One round = Compute &amp; communicate </p><p>Number of servers = p L </p><p>L </p><p>L </p><p>L </p><p>L </p><p>L </p><p>O(M/p) O(M/p) </p><p>10 </p></li><li><p>Example: Q(x,y,z) = R(x,y) S(y,z) </p><p>Server 1 Server p . . . . M/p M/p Input: </p><p> R, S uniformly partitioned on p servers </p><p>size(R)+size(S)=M </p><p>11 </p></li><li><p>Example: Q(x,y,z) = R(x,y) S(y,z) </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Round 1 O(M/p) O(M/p) </p><p>M/p M/p </p><p>Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p </p><p>Input: R, S uniformly </p><p>partitioned on p servers </p><p>size(R)+size(S)=M </p><p>12 </p></li><li><p>Example: Q(x,y,z) = R(x,y) S(y,z) </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Round 1 O(M/p) O(M/p) </p><p>M/p M/p </p><p>R(x,y) S(y,z) R(x,y) S(y,z) </p><p>Output: Each server computes and outputs </p><p>the local join R(x,y) S(y,z) </p><p>Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p </p><p>Input: R, S uniformly </p><p>partitioned on p servers </p><p>size(R)+size(S)=M </p><p>13 </p></li><li><p>Example: Q(x,y,z) = R(x,y) S(y,z) </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Round 1 O(M/p) O(M/p) </p><p>M/p M/p </p><p>R(x,y) S(y,z) R(x,y) S(y,z) </p><p>Output: Each server computes and outputs </p><p>the local join R(x,y) S(y,z) </p><p>Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p </p><p>Input: R, S uniformly </p><p>partitioned on p servers </p><p>Assuming no Skew:y, degreeR(y), degreeS(y) M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0) </p><p>size(R)+size(S)=M </p><p>Space exponent = 0 </p></li><li><p>Example: Q(x,y,z) = R(x,y) S(y,z) </p><p>Server 1 Server p . . . . </p><p>Server 1 Server p . . . . </p><p>Round 1 O(M/p) O(M/p) </p><p>M/p M/p </p><p>R(x,y) S(y,z) R(x,y) S(y,z) </p><p>Output: Each server computes and outputs </p><p>the local join R(x,y) S(y,z) </p><p>Round 1: each server Sends record R(x,y) to server h(y) mod p Sends record S(y,z) to server h(y) mod p </p><p>Input: R, S uniformly </p><p>partitioned on p servers </p><p>Assuming no Skew:y, degreeR(y), degreeS(y) M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0) </p><p>size(R)+size(S)=M </p><p>Space exponent = 0 </p><p>SQL is embarrassingly </p><p>parallel! </p></li><li><p>Questions </p><p>Fix a query Q and a number of servers p </p><p> What is the communication load L needed to compute Q in one round? This talk based on [Beame2013, Beame2014] </p><p> Fix a maximum load L (space exponent ) How many rounds are needed for Q?Preliminary results in [Beame2013] </p><p>16 </p></li><li><p>Outline </p><p> The MPC Model </p><p> Communication Cost for Triangles </p><p> Communication Cost for CQs </p><p> Discussion </p><p>17 </p></li><li><p>The Triangle Query </p><p> Input: three tables R(X, Y), S(Y, Z), T(Z, X) size(R)+size(S)+size(T)=M </p><p> Output: compute Q(x,y,z) = R(x,y) S(y,z) T(z,x) </p><p>Z X </p><p>Fred Alice </p><p>Jack Jim </p><p>Fred Jim </p><p>Carol Alice </p><p>Y Z </p><p>Fred Alice </p><p>Jack Jim </p><p>Fred Jim </p><p>Carol Alice </p><p>X Y </p><p>Fred Alice </p><p>Jack Jim </p><p>Fred Jim </p><p>Carol Alice </p><p>R </p><p>S </p><p>T </p><p>18 </p></li><li><p>Triangles in Two Rounds </p><p> Round 1: compute temp(X,Y,Z) = R(X,Y) S(Y,Z) </p><p> Round 2: compute Q(X,Y,Z) = temp(X,Y,Z) T(Z,X) </p><p> Load L depends on the size of temp. Can be much larger than M/p </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M </p></li><li><p>Triangles in One Round </p><p> Factorize p = p p p Each server identified by (i,j,k) </p><p>i </p><p>j k </p><p>(i,j,k) </p><p>P </p><p>Place the servers in a cube: Server (i,j,k) </p><p>[Ganguli92, Afrati&amp;Ullman10, Suri11, Beame13] </p><p>20 </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M </p></li><li><p>Triangles in One Round </p><p>k </p><p>(i,j,k) </p><p>P </p><p>Z X </p><p>Fred Alice </p><p>Jack Jim </p><p>Fred Jim </p><p>Carol Alice </p><p>Jack Jim Y Z </p><p>Fred Alice </p><p>Jack Jim </p><p>Fred Jim </p><p>Carol Alice </p><p>Jim Jack Jim Jack </p><p>X Y </p><p>Fred Alice </p><p>Jack Jim </p><p>Fred Jim </p><p>Carol Alice </p><p>R </p><p>S </p><p>T </p><p>i = h(Fred) </p><p>j = h(Jim) </p><p>Fred Jim Fred Jim </p><p>Fred Jim Fred Jim </p><p>Jim Jack </p><p>Jim Jack </p><p>Fred Jim Jim Jack </p><p>Jim Jack Fred Jim </p><p>Fred Jim </p><p>Jack Jim Jack Jim </p><p>Round 1: Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z)) </p><p>Output: compute locally R(x,y) S(y,z) T(z,x) </p><p>21 </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M </p></li><li><p>Upper Bound is Lalgo = O(M/p) </p><p>22 </p><p>Theorem Assume data has no skew: all degree O(M/p). Then the algorithm computes Q in one round, with communication load Lalgo = O(M/p) </p><p>Compare to two rounds: No more large intermediate result Skew-free up to degrees = O(M/p) (from O(M/p)) BUT: space exponent increased to = (from = 1) </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>M = size(R)+size(S)+size(T) (in bits) </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p># without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013 </p><p>M = size(R)+size(S)+size(T) (in bits) </p><p>23 </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p># without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013 </p><p>M = size(R)+size(S)+size(T) (in bits) </p><p>24 </p><p>Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>Assume that the three input relations R, S, T are stored on disjoint servers# </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p>M = size(R)+size(S)+size(T) (in bits) </p><p>25 </p><p>Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server </p><p>We prove a stronger claim, about any algorithm communicating &lt; Llower bits Let L be the algorithms max communication load per server We prove: if L &lt; Llower then, over random permutations R, S, T, the algorithm returns at most a fraction (L / Llower)3/2 of the answers to Q </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p># without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013 </p><p>Assume that the three input relations R, S, T are stored on disjoint servers# </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p Discussion: An algorithm with space exponent &lt; , has load </p><p>L = M/p1- and reports only (L / Llower)3/2 = 1/p(1-3)/2 fraction of all answers. Fewer, as p increases! </p><p> Lower bound holds only for random inputs: cannot hold for fixed input R, S, T since the algorithm may use a short bit encoding to signal this input to all servers </p><p> By Yaos lemma, the result also holds for randomized algorithms, some fixed input </p><p>26 </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R)+size(S)+size(T)=M </p></li><li><p>[Balard2012] consider Strassens matrix multiplication algorithm and provethe following theorem for the CDAG model. If each server has M c n</p><p>2</p><p>p2/lg7</p><p>memory, then the communication load per server:</p><p>IO(n, p,M) = </p><p> npM</p><p>lg7 Mp</p><p>!</p><p>Thus, if M = n2</p><p>p2/lg7then IO(n, p) = </p><p>n2</p><p>p2/lg7</p><p>Lower Bound is Llower = M / p </p><p> Stronger than MPC: no restriction on rounds Weaker than MPC </p><p>Applies only to an algorithm, not to a problem CDAG model of computation v.s. bit-model 27 </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>M here same as n2 here </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>Proof </p><p>28 </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>Proof </p><p>E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1 </p><p>29 </p><p>Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>Proof </p><p>E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1 </p><p>30 </p><p>One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it] </p><p>Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T </p><p>Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>Proof </p><p>E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1 </p><p>31 </p><p>One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it] </p><p>Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T E [#answers to Q reported by server v] = </p><p> i,j,k aij bjk cki [(i,j aij2) * (j,k bjk2) * (k,i cki2)] [Friedgut] (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ] by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above </p><p>Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Lower Bound is Llower = M / p </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) </p><p>Proof </p><p>E [#answers to Q] = i,j,k Pr [(i,j,k) RST] = n3 * (1/n)3 = 1 </p><p>E [#answers to Q reported by all servers] v [ L(v) / M ]3/2 = p * [ L / M ] 3/2 = [ L / Llower ]3/2 </p><p>32 </p><p>Input: size = M = 3 log n! R, S, T random permutations on [n]: Pr [(i,j) R] = 1/n, same for S, T </p><p>One server v [p]: receives L(v) = LR(v) + LS(v) + LT(v) bits Denote aij = Pr [(i,j)R and v knows it] </p><p>Then (1) aij 1/n (obvious) (2) i,j aij LR(v) / log n (formal proof: entropy) Denote similarly bjk,cki for S and T E [#answers to Q reported by server v] = </p><p> i,j,k aij bjk cki [(i,j aij2) * (j,k bjk2) * (k,i cki2)] [Friedgut] (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ] by (1), (2) (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above </p><p>size(R)+size(S)+size(T)=M </p></li><li><p>Outline </p><p> The MPC Model </p><p> Communication Cost for Triangles </p><p> Communication Cost for CQs </p><p> Discussion </p><p>33 </p></li><li><p>General Formula Consider an arbitrary conjunctive query (CQ): We will: Give a lower bound Llower formula Give an algorithm with load Lalgo Prove that Llower = Lalgo </p><p>34 </p><p>Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`)</p><p>size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`</p></li><li><p>Cartesian Product Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2 </p></li><li><p>Cartesian Product </p><p>S1(x) </p><p>S2 (y) </p><p>Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2 </p><p>Lalgo</p><p>= 2</p><p>M</p><p>1</p><p>M2</p><p>p</p><p> 12</p><p>Algorithm: factorize p = p1 p2 Send S1(x) to all servers (h(x),*) </p><p>Send S2(y) to all servers (*, h(y)) Load = M1 / p1 + M2 / p2; </p><p>Minimized when M1 / p1 = M2 / p2 = (M1M2/p) </p></li><li><p>Cartesian Product </p><p>37 </p><p>S1(x) </p><p>S2 (y) </p><p>Q(x,y) = S1(x) S2(y) size(S1)=M1, size(S2) = M2 </p><p>Lalgo</p><p>= 2</p><p>M</p><p>1</p><p>M2</p><p>p</p><p> 12</p><p>Llower</p><p>= 2</p><p>M</p><p>1</p><p>M2</p><p>p</p><p> 12</p><p>Lower bound: Let L(v) = L1(v) + L2(v) = load at server v The p servers must report all answers to Q: </p><p> M1 M2 v L1(v) * L2(v) v (L(v))2 p maxv (L(v))2 </p><p>Algorithm: factorize p = p1 p2 Send S1(x) to all servers (h(x),*) </p><p>Send S2(y) to all servers (*, h(y)) Load = M1 / p1 + M2 / p2; </p><p>Minimized when M1 / p1 = M2 / p2 = (M1M2/p) </p></li><li><p>A Simple Observation Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`</p><p>Proof similar to cartesian product on previous slide </p><p>Definition An edge packing for Q is a set of atoms with no common variables:</p><p>U ={Sj1(xj1), Sj2(xj2), . . . , Sj|U|(xj|U|)} 8i, k : xji \ xjk = ;</p><p>Claim Any one-round algorithm for Q must also compute the cartesian product:</p><p>Q0(xj1 , . . . ,xj|U|) = Sj1(xj1) on Sj2(xj2) on . . . on Sj|U|(xj|U|)</p><p>Proof Because the algorithm doesnt know the other Sj s.</p><p>Llower</p><p>= |U | Mj1 Mj2 Mj|U|</p><p>p</p><p> 1|U|</p><p>Assume that the three input relations S1, S2, are stored on disjoint servers# </p><p>#otherwise, add another factor |U| </p></li><li><p>The Lower Bound for CQ </p><p>39 </p><p>Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`</p><p>Theorem* Any one-round deterministic algorithm A for Q requires O(Llower) bits of communication per server: </p><p>Llower</p><p>=</p><p>M</p><p>1</p><p>u1 M2</p><p>u2 M`u`p</p><p> 1u1+u2++u`</p><p>*Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS 2014 </p><p>Note that we ignore constant factors (like |U| on the previous slide) </p><p>Definition A fractional edge packing are real numbers u1, u2, . . . , u` s.t.:</p><p>8j 2 [`] : uj</p><p> 0</p><p>8i 2 [k] :X</p><p>j2[`]:xi2vars(Sj(xj))</p><p>uj</p><p> 1</p></li><li><p>Triangle Query Revisited </p><p>40 </p><p>Q(X,Y,Z) = R(X,Y) S(Y,Z) T(Z,X) size(R) = M1, size(S) = M2, size(T) = M3 </p><p>Packing u1, u2, u3 L = (M1u1 * M2u2 * M3...</p></li></ul>

Recommended

View more >