final usc-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[micro 2015] branch divergence...

16
9/14/16 1 Parallel Graph Processing on GPUs, Clusters, and Multicores Farzad Khorasani Keval Vora Rajiv Gupta Graph Processing Graphs & Analytics Web Graphs PageRank – Rank websites in search results Belief Propagation – Malicious domains & infected hosts Social Networks Community Detection common interests Betweenness Centrality critical nodes Characteristics Graphs – Large number of nodes & edges Algorithms – Iterative convergence-based Exploit Data Parallelism 1 Parallel Graph Processing 2 – Limited Memory Storage System – Limited Device Memory & Bandwidth – Communication High Latency GPUs + Massive Parallelism Power Efficiency Clusters + Scalability Memory & Cores Multicores + Efficient Parallelism – Limited Memory Storage System – Limited Device Memory & Bandwidth – Communication High Latency GPUs (VWC, Medusa, Totem etc.) + Massive Parallelism Power Efficiency Clusters (GraphLab, GraphX etc.) + Scalability Memory & Cores Multicores (GraphChi, XStream etc.) + Efficient Parallelism Graphs on GPUs Suitable for Data Parallel Computations NVIDIA GeForce GTX780 12 Streaming Multiprocessors (SMs) 64 Warps/SM; 32 Threads/Warp Limitations SIMD Limitation on Warps Limited DRAM Capacity – 3 GB Limited PCIe Bandwidth – 12 GB/s 3 Challenges – Graphs on GPUs Low SIMD-Efficiency ~ Irregular Graphs Pokec 30 M Edges 1.6 M Vertices LiveJournal 69 M Edges 5 M Vertices Virtual Warp Centric (VWC) [Hong et al., PPoPP 2011] Warp utilization ~ 40% 4 Challenges – Graphs on GPUs Scalability for Large Graphs Limited DRAM Capacity Multiple GPUs Limited PCIe Bandwidth Low Communication Efficiency Medusa – 4 GPUs [Zhong & He, IEEE TPDS 2014] comm. efficiency 9 – 15% Totem – 2 GPUs [Gharaibeh et al., PACT 2012] comm. efficiency 11 – 18% 5

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

1

Parallel Graph Processing on GPUs, Clusters, and Multicores

Farzad Khorasani Keval Vora

Rajiv Gupta

Graph Processing Graphs & Analytics

Web Graphs PageRank – Rank websites in search results Belief Propagation – Malicious domains & infected hosts

Social Networks Community Detection – common interests Betweenness Centrality – critical nodes

Characteristics Graphs – Large number of nodes & edges Algorithms – Iterative convergence-based

à Exploit Data Parallelism

1  

Parallel Graph Processing 2  

– Limited Memory Storage System

– Limited Device Memory & Bandwidth – Communication High Latency

GPUs + Massive Parallelism Power Efficiency

Clusters

+ Scalability Memory & Cores

Multicores

+ Efficient Parallelism

– Limited Memory Storage System

– Limited Device Memory & Bandwidth – Communication High Latency

GPUs (VWC, Medusa, Totem etc.) + Massive Parallelism Power Efficiency

Clusters (GraphLab, GraphX etc.)

+ Scalability Memory & Cores

Multicores (GraphChi, XStream etc.)

+ Efficient Parallelism

Graphs on GPUs Suitable for Data Parallel Computations

NVIDIA GeForce GTX780 12 Streaming Multiprocessors (SMs) 64 Warps/SM; 32 Threads/Warp

Limitations

SIMD Limitation on Warps Limited DRAM Capacity – 3 GB Limited PCIe Bandwidth – 12 GB/s

3  

Challenges – Graphs on GPUs Low SIMD-Efficiency ~ Irregular Graphs

Pokec 30 M Edges

1.6 M Vertices

LiveJournal 69 M Edges 5 M Vertices

Virtual Warp Centric (VWC) [Hong et al., PPoPP 2011] à Warp utilization ~  40%

4  

Challenges – Graphs on GPUs Scalability for Large Graphs

Limited DRAM Capacity à Multiple GPUs à Limited PCIe Bandwidth

Low Communication Efficiency Medusa – 4 GPUs [Zhong & He, IEEE TPDS 2014]

à comm. efficiency 9 – 15% Totem – 2 GPUs [Gharaibeh et al., PACT 2012]

à comm. efficiency 11 – 18%

5  

Page 2: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

2

Our Approach Warp Segmentation

Variable number of threads given to a vertex to process its incoming edges à High SIMD-Efficiency Vertex Refinement

Eliminates unnecessary communication à High Communication Efficiency

6   Iterative Graph Processing Single Source Shortest Path (SSSP)

7  

0 4

1 2

3

0 4

1 2

3

4 212

3 5

3

4V0 V4

V1 V2

V3

source

= min( V1+ 3, V2+ 5 )

Iteration V0 V1 V2 V3 V4

0

1

2

3

0 ∞ ∞ ∞ ∞

0 4 2 ∞ ∞

0 3 2 7 5

0 3 2 5 6

Compressed Sparse Row (CSR) Graph Representation

V0 V4

V1 V2

V3

b c

d

e

f g

h

a

b c d e f g h a E

0 4 2 0 1 2 2 4

0 1 4 5 7 8 PTR

8  

V0 V1 V2 V3 V4 V

0 1 2 3 4

Existing Works – Static

Each vertex is assigned the same (fixed) number of threads

[Harish & Narayanan, HiPC 2007] – 1 thread [Hong et al, PPoPP 2011] – power of 2 threads [Kim & Batten, MICRO 2014] – a thread-block

à Leads to SIMD inefficiency

9  

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

Static – too few / many threads V0 V1

N0 N1 N2 N3 N4 N5

0 6 8

N6 N7

PTR

E

V

Assign 4 threads/vertex

N0 N1 N2 N3 N4 N5 N6 N7

RF RF

RF

C0 C1 C2 C3 C6 C7

R R

R R

RF RF

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7

R

RF

C4 C5

Time

R

R R

R

R

10  

. . . . . . . .

0 1 2 3 4 5 6 7

Warp Segmentation

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

V0 V1

N0 N1 N2 N3 N4 N5

0 6 8

N6 N7

PTR

E

V

C0 C1 C2 C3 C4 C5 C6 C7

R R

R

R

RF RF

R

R

N0 N1 N2 N3 N4 N5 N6 N7

R R

R R

R R

RF RF

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7

Time

Assign warp/vertex-group

11  

. . . . . . . .

0 1 2 3 4 5 6 7

Page 3: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

3

Warp Segmentation

Binary Search E item index in PTR

Belonging Vertex Index

Index inside Segment from right

Index inside Segment from left

Segment Size

Operation N0

[0,8]

[0,4]

[0,2]

[0,1]

0

5

0

6

N1

[0,8]

[0,4]

[0,2]

[0,1]

0

4

1

6

N2

[0,8]

[0,4]

[0,2]

[0,1]

0

3

2

6

N3

[0,8]

[0,4]

[0,2]

[0,1]

0

2

3

6

N4

[0,8]

[0,4]

[0,2]

[0,1]

0

1

4

6

N5

[0,8]

[0,4]

[0,2]

[0,1]

0

0

5

6

N6

[0,8]

[0,4]

[0,2]

[1,2]

1

1

0

2

N7

[0,8]

[0,4]

[0,2]

[1,2]

1

0

1

2

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

V0 V1

N0 N1 N2 N3 N4 N5

0 6 8

N6 N7

PTR

E

V

12  

. . . . . . . .

0 1 2 3 4 5 6 7

Warp Segmentation 13  

Binary Search E item index in PTR

Belonging Vertex Index

Index inside Segment from right

Index inside Segment from left

Segment Size

Operation N3

[0,8]

[0,4]

[0,2]

[0,1]

0

2

3

6

6 – 3 – 1 3 – 0

2 + 3 + 1

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

V0 V1

N0 N1 N2 N3 N4 N5

0 6

N6 N7

PTR

E

V

3

0 1 2 3 4 5 6 7

Efficiency of Warp Segmentation No shared memory atomics for reduction. No synchronization primitives used. All memory accesses are coalesced except for accessing neighbor’s values.

Exploits instruction-level parallelism.

14  Experimental Setup

NVIDIA GeForce GTX780 Programs

BFS Breadth First Search

CC Connected Components

NN Neural Network

PR PageRank

SSSP Single Source Shortest Path

SSWP Single Source Widest Path

Graphs (#V, #E)

RM33V335E 33m, 335m

Orkut 3.07m, 234m

LiveJournal 4.85m, 69m

SocPokec 1.63m, 30.6m

RoadNetCA 1.97m, 5.5m

Amazon0312 0.40m, 3.2m

15  

Speedups Over VWC

Average for Programs

BFS 1.27x – 2.60x

CC 1.33x – 2.90x

NN 1.21x – 2.70x

PR 1.22x – 2.68x

SSSP 1.31x – 2.76x

SSWP 1.28x – 2.80x

Average for Graphs

RM33V335E 1.23x – 1.56x

Orkut 1.15x – 1.99x

LiveJournal 1.29x – 1.99x

SocPokec 1.27x – 1.77x

RoadNetCA 1.24x – 9.90x

Amazon0312 1.53x – 2.68x

16   Warp Execution Efficiency

0

20

40

60

80

RM33V33

5E

ComOrku

t

ER25V20

1E

RM25V20

1E

RM16V20

1E

RM16V13

4E

LiveJ

ourna

l

SocPok

ec

HiggsT

witter

RoadN

etCA

WebGoo

gle

Amazon

0312

War

p E

xecu

tion

Effi

cien

cy (%

)

VWC-2 VWC-4 VWC-8 VWC-16 VWC-32 Warp Seg.

17  

SSSP

Page 4: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

4

Scalability – Multiple-GPUs

V0 V4

V1

V2

V3 V5

V6

PCIe GPU DRAM

Bandwidth (GB/s)

~12 ~300

18  

V0 V4

V1

V2

V3 V5

V6 V0 V1 V2 V5 V6

a b c d e f

0 0 1 4 2 5

0 0 1 6 7 9

g

3

f

4

g

5

V3 V4

2 4

a b c d

0 0 1 4

e f

2 5

g

3

f

4

g

5

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

19  Scalability – Multiple-GPUs V

PTR

E

Medusa - ALL Method [IEEE TPDS 2014]

a b c d

0 0 1 4

e f

2 5

g

3

f

4

g

5

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

V0 V1 V2 V3

V5 V6 V4

à Communication Efficiency 9-15%

20  

a b c d

0 0 1 4

e f

2 5

g

3

f

4

g

5

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

V2 V3

V4

Totem - Maximal Subset (MS) [PACT 2012]

à Communication Efficiency 11-18%

21  

Vertex Refinement  (VR)

Offline Vertex Refinement Marks Boundary Vertices

Online Vertex Refinement On-the-fly by CUDA kernel

Look for updates in values

22   Online Vertex Refinement

O[A

+0]=V0

O[A

+1]=V3

O[A

+2]=V7

Is (Updated & Marked)

Shuffle A

Binary Prefix Sum

Operation V0

3

A

0

V1

3

A

1

V2

3

A

1

V3

3

A

1

V4

3

A

2

V5

3

A

2

V6

3

A

2

V7

3

A

2

Y N Y Y N N N N

Binary Reduction

Reserve Outbox Region A = atomicAdd( deviceOutboxMovingIndex, 3 );

Fill Outbox Region

V0 V1 V2 V3 V4 V5 V6 V7

V0 V3 V7 Device Outbox

T #0 T#1 T#2 T#3 T#4 T#5 T#6 T#7

23  

Page 5: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

5

Buffers – Inbox and Outbox Double Buffering Host-as-hub

PCIe Lanes

≤M ≤M

M+1

V

Q

Q

GPU #0

PTR

E

Values Indices

≤N ≤N

N+1

V

R

R

GPU #1

PTR

E

Values Indices

≤P

P+1

V

T

T

GPU #2

PTR

E

Values Indices ≤P

Outbox Outbox Outbox

M M M N N N P P P

≤M ≤N ≤P ≤M ≤N

Values Indices ≤P

Inbox #0 Inbox #1 Inbox #2

Host Memory

≤M ≤N ≤P ≤M ≤N

Values Indices ≤P

Inbox #0 Inbox #1 Inbox #2

Odd Buffer Even Buffer

PCIe Lanes PCIe Lanes

24   Speedup – VR vs. MS & ALL

Input graph BFS PR SSSP

RM54V704E 54m, 704m

over ALL   1.85x   1.48x   1.75x  

over MS   1.82x   1.47x   1.71x  

RM41V536E 42m, 537m

over ALL   1.89x   1.44x   1.80x  

over MS   1.82x   1.39x   1.75x  

RM41V503E 42m, 503m

over ALL   1.29x   1.18x   1.24x  

over MS   1.27x   1.15x   1.21x  

RM35V402E 36m, 403m

over ALL   1.32x   1.21x   1.25x  

over MS   1.30x   1.20x   1.22x  

3 GPUs

2 GPUs

25  

VR vs. MS & ALL

0

0.2

0.4

0.6

0.8

1

ALL MS VR ALL MS VR ALL MS VR

BFS PR SSSP

Nor

mal

ized

Agg

rega

ted

Pro

cess

ing

Tim

e

Aggregated Computation Duration Aggregated Communication Duration

3 GPUs

26  

Irregular Computations Graphs on GPUs [PACT 2015a] Warp Segmentation & Vertex Refinement

[HPDC 2014] CuSha+CW Graph Representation

Hashing on GPUs [PACT 2015b] Stadium Hashing vs. Cuckoo Hashing

Generalization and Automation [MICRO 2015] Branch Divergence (CCC)

[IPDPS 2016] Thread Assignment (CTE)

27  

Graphs on Clusters Scalability for Large Graphs

Partition graph across machines Exploit memory across all nodes Exploit cores across all nodes

Challenges Programmability – Distributed programs Performance – Network latency is high

28  

Our Approach Programmability

ASPIRE Distributed Shared Memory (DSM)

Performance Caching Protocol to tolerate

Network Latency à TARDIS (16 node cluster with infiniband) remote fetch 2.3x slower than local

Fetch(c)  Fetch(a)  Fetch(b)  c’  =  f  (c,  a,  b)  Store(c,  c’)  

29  

Page 6: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

6

Relaxed Consistency Protocol (RCP) Allow use of stale values to tolerate network latency

Staleness of a value number of updates performed

Using stale values slows convergence

Relax consistency without

 delaying convergence

30  

[Cui et al., USENIX ATC 2014] Controls staleness using static threshold

   

Bounded Staleness

Itera

tions

to

conv

erge

nce

Number of Remote Fetches

High Threshold

Low Threshold ?

31  

Relaxed Consistency Protocol Set a High Threshold

Fetches are allowed to use stale values Minimize impact of network latency

Best Effort Refresh Minimize use of stale values Minimize staleness of stale values used

32  

Cache-miss value not in cache

Current-hit value in cache; staleness = 0

Stale-hit value in cache; 0 < staleness ≤ t Issue refresh request

Stale-miss value in cache; staleness > t

Relaxed Consistency Protocol 33  

Relaxed Consistency Protocol

Shared  

Stale  

Cache-­‐Miss  /  Write  [Local  Node]  Staleness  =  0  

Evict  [Local  Node]  

Hit  /  Write  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Stale-­‐Hit  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Evict  [Local  Node]  

Stale-­‐Miss  [Local  Node]  Staleness  =  0  

Uncached  

Refresh  [Local  Node]  Staleness  =  0  

34  

Shared  

Stale  

Uncached  

Relaxed Consistency Protocol

Shared  

Stale  

Cache-­‐Miss  /  Write  [Local  Node]  Staleness  =  0  

Evict  [Local  Node]  

Hit  /  Write  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Stale-­‐Hit  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Evict  [Local  Node]  

Stale-­‐Miss  [Local  Node]  Staleness  =  0  

Uncached  

Refresh  [Local  Node]  Staleness  =  0  

35  

Page 7: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

7

Relaxed Consistency Protocol

Shared  

Stale  

Cache-­‐Miss  /  Write  [Local  Node]  Staleness  =  0  

Evict  [Local  Node]  

Hit  /  Write  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Stale-­‐Hit  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Evict  [Local  Node]  

Stale-­‐Miss  [Local  Node]  Staleness  =  0  

Uncached  

Refresh  [Local  Node]  Staleness  =  0  

36  

Relaxed Consistency Protocol

Shared  

Stale  

Cache-­‐Miss  /  Write  [Local  Node]  Staleness  =  0  

Evict  [Local  Node]  

Hit  /  Write  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Stale-­‐Hit  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Evict  [Local  Node]  

Stale-­‐Miss  [Local  Node]  Staleness  =  0  

Uncached  

Refresh  [Local  Node]  Staleness  =  0  

37  

Relaxed Consistency Protocol

Shared  

Stale  

Cache-­‐Miss  /  Write  [Local  Node]  Staleness  =  0  

Evict  [Local  Node]  

Hit  /  Write  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Stale-­‐Hit  [Local  Node]  

Invalidate  [Directory]  ++Staleness  

Evict  [Local  Node]  

Stale-­‐Miss  [Local  Node]  Staleness  =  0  

Uncached  

Refresh  [Local  Node]  Staleness  =  0  

38  Experimental Setup

Tardis -- 16-node cluster

Programs

SSSP Single Source Shortest Path

CC Connected Components

CD Community Detection

PR PageRank

GC Graph Coloring

Graphs (#V, #E)

Orkut 3.07m, 234m

LiveJournal 4.85m, 69m

Pokec 1.63m, 30.6m

HiggsTwitter 0.46m, 14.8m

RoadNetCA 1.97m, 5.5m

RoadNetTX 1.38m, 3.84m

39  

CD CC GC PR SSSP

Orkut 1.57x 2.60x 2.14x 1.55x 2.43x

LiveJournal 3.71x 3.30x 2.47x 1.83x 2.39x

Pokec 2.40x 2.04x 2.26x 6.42x 2.09x

HiggsTwitter 1.94x 3.34x 2.05x 6.78x 1.25x

RoadNetCA 1.35x 1.13x 0.56x 1.02x 1.03x

RoadNetTX 0.94x 0.94x 0.88x 0.87x 0.98x  RCP is over 2x faster for Graph Algorithms

Execution Time: RCP over SCP 40   Remote Fetches

RCP blocks on 42% of fetches Best Stale-n blocks on 86%

41  

Page 8: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

8

Iterations for Convergence

RCP requires 49% more iterations Stale-2 & 3 require 146% & 176% more

42   Staleness Percentage

       97%  of  values  have  staleness  0            2%  of  values  have  staleness  1  

43  

 RCP compares well with GraphLab RCP is orthogonal to GraphLab

RCP over GraphLab [VLDB’12] SSSP PR CC

Orkut 1.48x 1.01x 1.13x

LiveJournal 0.72x 0.86x 3.03x

Pokec 0.92x 0.94x 5.71x

HiggsTwitter 2.20x -- 3.95x

RoadNetCA 1.22x 11.5x 3.88x

RoadNetTX 0.41x 15.5x 2.27x

44  

Vora, Koduru, & Gupta [OOPSLA 2014] Exploiting Asynchronous Parallelism in Iterative Algorithms using Relaxed Consistency based DSM.

Other Work: Graphs on Clusters u  Evolving Graphs  [TACO 2016] u  Streaming Graphs u  Confined Recovery

 

45  

Out-­‐of-­‐core  Graph  Processing  GraphChi  [OSDI’12]  –  disk-­‐friendly  Graph  is  split  across  multiple  shards  Created  during  pre-­‐processing  Remain  static  throughout  processing  

Load  &  Process  one  shard  at  a  time  Iterative  processing  –  I/O  bound  GraphChi  spends  73-­‐88%  time  on  loading  

46  

Not  all  edges  are  always  required  

Opportunity  

0%

20%

40%

60%

80%

100%

10 20 30 40 50 60 70

% U

sefu

l Ed

ges

Iteration

LJNFUKTTFT

PageRank  

047  

Page 9: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

9

Challenges  Different  algorithms  behave  differently  

 Need  dynamic  partitions  

0

0.2

0.4

0.6

0.8

1

4 6 8 10

Idea

l Sh

ard

Siz

e

Iteration

PRMSSP

CC

048  Challenges  

How  to  create  dynamic  partitions  

on  the  'ly?  

How  to  process  dynamic  partitions?  

Different  algorithms  behave  differently  

 Need  dynamic  partitions    

0

0.2

0.4

0.6

0.8

1

4 6 8 10

Idea

l Sh

ard

Siz

e

Iteration

PRMSSP

CC

048  

Shards  Maintain  locality  while  processing  

a   b  

c  

e  

d  

Src   Dst   Value  a   b   e0  a   c   e1  d   a   e2  e   c   e3  

Shard  0  

Src   Dst   Value  b   d   e4  d   e   e5  e   d   e6  

Shard  1  

049  

Src   Dst   Value  a   b   e0  a   c   e1  d   a   e2  e   c   e3  

a   b  

c  

Shards  Maintain  locality  while  processing  Load,  compute,  store  

Shard  0  

Src   Dst   Value  b   d   e4  d   e   e5  e   d   e6  

Shard  1  

Main  Memory  

a   b  

c  

e  

d  e  

d  

d  

e  c  

a   b  

049  

Src   Dst   Value  a   b   e0  a   c   e1  d   a   e2  e   c   e3  

a   b  

c  

Shards  Maintain  locality  while  processing  Load,  compute,  store  

Shard  0  

Src   Dst   Value  b   d   e4  d   e   e5  e   d   e6  

Shard  1  

Main  Memory  

a   b  

c  

e  

d  e  

d  

d  

e  c  

a   b  

049  

Src   Dst   Value  a   b   e0  a   c   e1  d   a   e2  e   c   e3  

a   b  

c  

Shards  Maintain  locality  while  processing  Load,  compute,  store  

Shard  0  

Src   Dst   Value  b   d   e4  d   e   e5  e   d   e6  

Shard  1  

Main  Memory  

a   b  

c  

e  

d  e  

d  

d  

e  c  

a   b  

049  

Page 10: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

10

Dynamic  Shards  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Src   Dst   Value  e   d   9  

Src   Dst   Value  a   b   2  a   c   3  

Time  

3  out  of  7  Edges  

050  

How  to  efGiciently  create  dynamic  shards  on  the  'ly?  

Dynamic  Shards  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Src   Dst   Value  e   d   9  

Src   Dst   Value  a   b   2  a   c   3  

Time  

050  

Creating  Dynamic  Shards  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory   c  

a   b  

c  

a   b  

a   b  

c  

e  

d  a   b  

c  

51  Creating  Dynamic  Shards  

Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  e  

d  

d  

e  

d  

e  

51  

Creating  Dynamic  Shards  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

c  

a   b  

Src   Dst   Value  a   b   2  a   c   3  

Src   Dst   Value  

c  

a   b  

a   b  

c  

e  

d  

51  Creating  Dynamic  Shards  

Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value  a   b   2  a   c   3  

Src   Dst   Value  

e  

d  

d  

e  

d  

e  

51  

Page 11: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

11

Creating  Dynamic  Shards  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Src   Dst   Value  a   b   2  a   c   3  

Src   Dst   Value  

Sequential  writes  Light-­‐weight  

51  Processing  Dynamic  Shards  

Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

a   b  

c  

e  

d  

52  

Processing  Dynamic  Shards  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

a   b  

c  

e  

d  

How  to  process  vertices  with    missing  edges?  

52  

a   b  

c  

a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  

53  

a  

c  

b  a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  

53  

a  

c  

b  a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  Src   Dst   Value   Src   Dst   Value  

b   d   5  

53  

Page 12: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

12

a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  Src   Dst   Value   Src   Dst   Value  

b   d   5  

e  

d  

53  

a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  Src   Dst   Value   Src   Dst   Value  

b   d   5  

e  

d  

53  

a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  Src   Dst   Value   Src   Dst   Value  

b   d   5  

e  

d  

e  

d  

53  

a  

Delay  based  Processing  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Delay  computations  Src   Dst   Value   Src   Dst   Value  

b   d   5  

e  

d  d  e  

53  

Delayed  Computations  are  held  in-­‐memory  buffer  Periodically  process  delayed  vertices  Shadow  iteration  -­‐-­‐  all  edges  are  made  available  

a  Main  

Memory   d  e  

Delay  based  Processing  54  

a  

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

a   b  

c  

d  e  

55  

Page 13: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

13

a  

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

a   b  

c  d  e  

55  

a  

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Shard  0  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

a   b  

c  

a   b  

c  

d  e  

55  

a  

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

c  

a   b  

d  e  

55  

a  

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

c  

a   b  

d  e  

55  

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

e  

d  

e  

d  

d  

d  e  

55  Shadow  Iteration  

Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

d  

e  

d  e  

55  

Page 14: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

14

Shadow  Iteration  Src   Dst   Value  a   b   5  a   c   4  d   a   7  e   c   9  

Src   Dst   Value  b   d   6  d   e   4  e   d   1  

Shard  1  

Src   Dst   Value  b   d   3  e   d   6  

Src   Dst   Value  a   b   4  a   c   6  e   c   8  

Time  

Main  Memory  

a   b  

c  

e  

d  

Src   Dst   Value   Src   Dst   Value  b   d   5  

d   d  

e  e  

55  Dynamic  Shards  

S0  

S1  

Sn-­‐1  

DS00  

DS01  

DS0n-­‐1  

DSi0  

DSi1  

Dsin-­‐1  

DSi0+1  

DSi1+1  

DSi+1  n-­‐1  

Shadow  Iteration  

56  

Dynamic  Shards  

S0  

S1  

Sn-­‐1  

DS00  

DS01  

DS0n-­‐1  

DSi0  

DSi1  

Dsin-­‐1  

DSi0+1  

DSi1+1  

DSi+1  n-­‐1  

Shadow  Iteration  

64-­‐73%  of  active  vertices  get  delayed  

56  Accumulation  based  Processing  

Before Now

d  c  

b  

e  e  

d  c  

b  

e  e  

d  c  

b  

e  e  

d  c  

b  

e  e  

No Del

ay

Partial

Delay

e  

57  

Accumulation  based  Processing  58  

Evaluation  Setup  Dell  8-­‐core,  8  GB  main  memory  Dell  500GB  7.2K  RPM  HDD,  Dell  400GB  SSD  Ubuntu  14.04  File  system  caches  dlushed    

59  

Page 15: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

15

Benchmark  &  Inputs  5  graph  algorithms  PageRank,  Belief  Propagation,  Heat  Simulation,    Multiple  Source  Shortest  Paths,  Connected  Components  

6  real-­‐world  input  graphs    

60  Performance  

1.8x  average  speedup  

0

0.5

1

1.5

2

2.5

3

UK

TT FT UK

TT FT UK

TT FT UK

TT FT UK

TT FT

Spee

du

p

CCMSSPHSBPPR

61  

Performance  1.8x  average  speedup  

0

0.5

1

1.5

2

2.5

3

UK

TT FT UK

TT FT UK

TT FT UK

TT FT UK

TT FT

Spee

du

p

CCMSSPHSBPPR

61  Dynamic  Shard  Size  

0

0.2

0.4

0.6

0.8

1

15 20 25 30

Nor

mal

ized

Sh

ard

Siz

e

Iteration

0

0.2

0.4

0.6

0.8

1

15 20 25 30 35 40

Nor

mal

ized

Sh

ard

Siz

e

Iteration

0

0.2

0.4

0.6

0.8

1

5 6 7 8 9 10 11 12 13

Nor

mal

ized

Sh

ard

Siz

e

Iteration

PR  on  FT   HS  on  FT  

BP  on  FT  

62  

Reads  &  Writes  Up  to  64%  reduction  in  data  read  Up  to  54%  reduction  in  data  written  Shadow  iterations  are  I/O  intensive  

0

0.2

0.4

0.6

0.8

1

UK

TT FT UK

TT FT UK

TT FT

Rea

d S

ize

Regular ReadsShadow Reads

HSBPPR

0 0.2 0.4 0.6 0.8

1 1.2 1.4

UK

TT FT UK

TT FT UK

TT FT

Wri

tes

Size

Regular WritesShadow Writes

HSBPPR

63  

Vora, Xu, & Gupta [USENIX ATC 2016] Load the Edges You Need: A Generic I/O Optimization for Distributive Disk-based Graph Algorithms.

Other Work: Graphs on Multicores u  Graph Reduction  [HPDC 2016] u  Out-of-core Version

 

64  

Page 16: final USC-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[MICRO 2015] Branch Divergence (CCC) [IPDPS 2016] Thread Assignment (CTE) 27 Graphs on Clusters Scalability for

9/14/16

16

Summary – Graph Processing On GPUs

SIMD-efficiency – Keeping threads busy Communication – Efficient use of PCIe bandwidth

On Clusters Communication – Tolerating network latency

On Multicore I/O Efficiency – Transforming representation

65