final usc-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[micro 2015] branch divergence...

9/14/16

1

Parallel Graph Processing on GPUs, Clusters, and Multicores

Farzad Khorasani Keval Vora

Rajiv Gupta

Graph Processing Graphs & Analytics

Web Graphs PageRank – Rank websites in search results Belief Propagation – Malicious domains & infected hosts

Social Networks Community Detection – common interests Betweenness Centrality – critical nodes

Characteristics Graphs – Large number of nodes & edges Algorithms – Iterative convergence-based

à Exploit Data Parallelism

1

Parallel Graph Processing 2

– Limited Memory Storage System

– Limited Device Memory & Bandwidth – Communication High Latency

GPUs + Massive Parallelism Power Efficiency

Clusters

+ Scalability Memory & Cores

Multicores

+ Efficient Parallelism

– Limited Memory Storage System

– Limited Device Memory & Bandwidth – Communication High Latency

GPUs (VWC, Medusa, Totem etc.) + Massive Parallelism Power Efficiency

Clusters (GraphLab, GraphX etc.)

+ Scalability Memory & Cores

Multicores (GraphChi, XStream etc.)

+ Efficient Parallelism

Graphs on GPUs Suitable for Data Parallel Computations

NVIDIA GeForce GTX780 12 Streaming Multiprocessors (SMs) 64 Warps/SM; 32 Threads/Warp

Limitations

SIMD Limitation on Warps Limited DRAM Capacity – 3 GB Limited PCIe Bandwidth – 12 GB/s

3

Challenges – Graphs on GPUs Low SIMD-Efficiency ~ Irregular Graphs

Pokec 30 M Edges

1.6 M Vertices

LiveJournal 69 M Edges 5 M Vertices

Virtual Warp Centric (VWC) [Hong et al., PPoPP 2011] à Warp utilization ~ 40%

4

Challenges – Graphs on GPUs Scalability for Large Graphs

Limited DRAM Capacity à Multiple GPUs à Limited PCIe Bandwidth

Low Communication Efficiency Medusa – 4 GPUs [Zhong & He, IEEE TPDS 2014]

à comm. efficiency 9 – 15% Totem – 2 GPUs [Gharaibeh et al., PACT 2012]

à comm. efficiency 11 – 18%

5

9/14/16

2

Our Approach Warp Segmentation

Variable number of threads given to a vertex to process its incoming edges à High SIMD-Efficiency Vertex Refinement

Eliminates unnecessary communication à High Communication Efficiency

6 Iterative Graph Processing Single Source Shortest Path (SSSP)

7

0 4

1 2

3

0 4

1 2

3

4 212

3 5

3

4V0 V4

V1 V2

V3

source

= min( V1+ 3, V2+ 5 )

Iteration V0 V1 V2 V3 V4

0

1

2

3

0 ∞ ∞ ∞ ∞

0 4 2 ∞ ∞

0 3 2 7 5

0 3 2 5 6

Compressed Sparse Row (CSR) Graph Representation

V0 V4

V1 V2

V3

b c

d

e

f g

h

a

b c d e f g h a E

0 4 2 0 1 2 2 4

0 1 4 5 7 8 PTR

8

V0 V1 V2 V3 V4 V

0 1 2 3 4

Existing Works – Static

Each vertex is assigned the same (fixed) number of threads

[Harish & Narayanan, HiPC 2007] – 1 thread [Hong et al, PPoPP 2011] – power of 2 threads [Kim & Batten, MICRO 2014] – a thread-block

à Leads to SIMD inefficiency

9

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

Static – too few / many threads V0 V1

N0 N1 N2 N3 N4 N5

0 6 8

N6 N7

PTR

E

V

Assign 4 threads/vertex

N0 N1 N2 N3 N4 N5 N6 N7

RF RF

RF

C0 C1 C2 C3 C6 C7

R R

R R

RF RF

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7

R

RF

C4 C5

Time

R

R R

R

R

10

. . . . . . . .

0 1 2 3 4 5 6 7

Warp Segmentation

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

V0 V1

N0 N1 N2 N3 N4 N5

0 6 8

N6 N7

PTR

E

V

C0 C1 C2 C3 C4 C5 C6 C7

R R

R

R

RF RF

R

R

N0 N1 N2 N3 N4 N5 N6 N7

R R

R R

R R

RF RF

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7

Time

Assign warp/vertex-group

11

. . . . . . . .

0 1 2 3 4 5 6 7

9/14/16

3

Warp Segmentation

Binary Search E item index in PTR

Belonging Vertex Index

Index inside Segment from right

Index inside Segment from left

Segment Size

Operation N0

[0,8]

[0,4]

[0,2]

[0,1]

0

5

0

6

N1

[0,8]

[0,4]

[0,2]

[0,1]

0

4

1

6

N2

[0,8]

[0,4]

[0,2]

[0,1]

0

3

2

6

N3

[0,8]

[0,4]

[0,2]

[0,1]

0

2

3

6

N4

[0,8]

[0,4]

[0,2]

[0,1]

0

1

4

6

N5

[0,8]

[0,4]

[0,2]

[0,1]

0

0

5

6

N6

[0,8]

[0,4]

[0,2]

[1,2]

1

1

0

2

N7

[0,8]

[0,4]

[0,2]

[1,2]

1

0

1

2

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

V0 V1

N0 N1 N2 N3 N4 N5

0 6 8

N6 N7

PTR

E

V

12

. . . . . . . .

0 1 2 3 4 5 6 7

Warp Segmentation 13

Binary Search E item index in PTR

Belonging Vertex Index

Index inside Segment from right

Index inside Segment from left

Segment Size

Operation N3

[0,8]

[0,4]

[0,2]

[0,1]

0

2

3

6

6 – 3 – 1 3 – 0

2 + 3 + 1

V1 V0

N7 N6 N3 N2 N1 N0 N5 N4

V0 V1

N0 N1 N2 N3 N4 N5

0 6

N6 N7

PTR

E

V

3

0 1 2 3 4 5 6 7

Efficiency of Warp Segmentation No shared memory atomics for reduction. No synchronization primitives used. All memory accesses are coalesced except for accessing neighbor’s values.

Exploits instruction-level parallelism.

14 Experimental Setup

NVIDIA GeForce GTX780 Programs

BFS Breadth First Search

CC Connected Components

NN Neural Network

PR PageRank

SSSP Single Source Shortest Path

SSWP Single Source Widest Path

Graphs (#V, #E)

RM33V335E 33m, 335m

Orkut 3.07m, 234m

LiveJournal 4.85m, 69m

SocPokec 1.63m, 30.6m

RoadNetCA 1.97m, 5.5m

Amazon0312 0.40m, 3.2m

15

Speedups Over VWC

Average for Programs

BFS 1.27x – 2.60x

CC 1.33x – 2.90x

NN 1.21x – 2.70x

PR 1.22x – 2.68x

SSSP 1.31x – 2.76x

SSWP 1.28x – 2.80x

Average for Graphs

RM33V335E 1.23x – 1.56x

Orkut 1.15x – 1.99x

LiveJournal 1.29x – 1.99x

SocPokec 1.27x – 1.77x

RoadNetCA 1.24x – 9.90x

Amazon0312 1.53x – 2.68x

16 Warp Execution Efficiency

0

20

40

60

80

RM33V33

5E

ComOrku

t

ER25V20

1E

RM25V20

1E

RM16V20

1E

RM16V13

4E

LiveJ

ourna

l

SocPok

ec

HiggsT

witter

RoadN

etCA

WebGoo

gle

Amazon

0312

War

p E

xecu

tion

Effi

cien

cy (%

)

VWC-2 VWC-4 VWC-8 VWC-16 VWC-32 Warp Seg.

17

SSSP

9/14/16

4

Scalability – Multiple-GPUs

V0 V4

V1

V2

V3 V5

V6

PCIe GPU DRAM

Bandwidth (GB/s)

~12 ~300

18

V0 V4

V1

V2

V3 V5

V6 V0 V1 V2 V5 V6

a b c d e f

0 0 1 4 2 5

0 0 1 6 7 9

g

3

f

4

g

5

V3 V4

2 4

a b c d

0 0 1 4

e f

2 5

g

3

f

4

g

5

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

19 Scalability – Multiple-GPUs V

PTR

E

Medusa - ALL Method [IEEE TPDS 2014]

a b c d

0 0 1 4

e f

2 5

g

3

f

4

g

5

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

V0 V1 V2 V3

V5 V6 V4

à Communication Efficiency 9-15%

20

a b c d

0 0 1 4

e f

2 5

g

3

f

4

g

5

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

V2 V3

V4

Totem - Maximal Subset (MS) [PACT 2012]

à Communication Efficiency 11-18%

21

Vertex Refinement (VR)

Offline Vertex Refinement Marks Boundary Vertices

Online Vertex Refinement On-the-fly by CUDA kernel

Look for updates in values

22 Online Vertex Refinement

O[A

+0]=V0

O[A

+1]=V3

O[A

+2]=V7

Is (Updated & Marked)

Shuffle A

Binary Prefix Sum

Operation V0

3

A

0

V1

3

A

1

V2

3

A

1

V3

3

A

1

V4

3

A

2

V5

3

A

2

V6

3

A

2

V7

3

A

2

Y N Y Y N N N N

Binary Reduction

Reserve Outbox Region A = atomicAdd( deviceOutboxMovingIndex, 3 );

Fill Outbox Region

V0 V1 V2 V3 V4 V5 V6 V7

V0 V3 V7 Device Outbox

T #0 T#1 T#2 T#3 T#4 T#5 T#6 T#7

23

9/14/16

5

Buffers – Inbox and Outbox Double Buffering Host-as-hub

PCIe Lanes

≤M ≤M

M+1

V

Q

Q

GPU #0

PTR

E

Values Indices

≤N ≤N

N+1

V

R

R

GPU #1

PTR

E

Values Indices

≤P

P+1

V

T

T

GPU #2

PTR

E

Values Indices ≤P

Outbox Outbox Outbox

M M M N N N P P P

≤M ≤N ≤P ≤M ≤N

Values Indices ≤P

Inbox #0 Inbox #1 Inbox #2

Host Memory

≤M ≤N ≤P ≤M ≤N

Values Indices ≤P

Inbox #0 Inbox #1 Inbox #2

Odd Buffer Even Buffer

PCIe Lanes PCIe Lanes

24 Speedup – VR vs. MS & ALL

Input graph BFS PR SSSP

RM54V704E 54m, 704m

over ALL 1.85x 1.48x 1.75x

over MS 1.82x 1.47x 1.71x

RM41V536E 42m, 537m

over ALL 1.89x 1.44x 1.80x

over MS 1.82x 1.39x 1.75x

RM41V503E 42m, 503m

over ALL 1.29x 1.18x 1.24x

over MS 1.27x 1.15x 1.21x

RM35V402E 36m, 403m

over ALL 1.32x 1.21x 1.25x

over MS 1.30x 1.20x 1.22x

3 GPUs

2 GPUs

25

VR vs. MS & ALL

0

0.2

0.4

0.6

0.8

1

ALL MS VR ALL MS VR ALL MS VR

BFS PR SSSP

Nor

mal

ized

Agg

rega

ted

Pro

cess

ing

Tim

e

Aggregated Computation Duration Aggregated Communication Duration

3 GPUs

26

Irregular Computations Graphs on GPUs [PACT 2015a] Warp Segmentation & Vertex Refinement

[HPDC 2014] CuSha+CW Graph Representation

Hashing on GPUs [PACT 2015b] Stadium Hashing vs. Cuckoo Hashing

Generalization and Automation [MICRO 2015] Branch Divergence (CCC)

[IPDPS 2016] Thread Assignment (CTE)

27

Graphs on Clusters Scalability for Large Graphs

Partition graph across machines Exploit memory across all nodes Exploit cores across all nodes

Challenges Programmability – Distributed programs Performance – Network latency is high

28

Our Approach Programmability

ASPIRE Distributed Shared Memory (DSM)

Performance Caching Protocol to tolerate

Network Latency à TARDIS (16 node cluster with infiniband) remote fetch 2.3x slower than local

Fetch(c) Fetch(a) Fetch(b) c’ = f (c, a, b) Store(c, c’)

29

9/14/16

6

Relaxed Consistency Protocol (RCP) Allow use of stale values to tolerate network latency

Staleness of a value number of updates performed

Using stale values slows convergence

Relax consistency without

delaying convergence

30

[Cui et al., USENIX ATC 2014] Controls staleness using static threshold

Bounded Staleness

Itera

tions

to

conv

erge

nce

Number of Remote Fetches

High Threshold

Low Threshold ?

31

Relaxed Consistency Protocol Set a High Threshold

Fetches are allowed to use stale values Minimize impact of network latency

Best Effort Refresh Minimize use of stale values Minimize staleness of stale values used

32

Cache-miss value not in cache

Current-hit value in cache; staleness = 0

Stale-hit value in cache; 0 < staleness ≤ t Issue refresh request

Stale-miss value in cache; staleness > t

Relaxed Consistency Protocol 33

Relaxed Consistency Protocol

Shared

Stale

Cache-‐Miss / Write [Local Node] Staleness = 0

Evict [Local Node]

Hit / Write [Local Node]

Invalidate [Directory] ++Staleness

Stale-‐Hit [Local Node]


Evict [Local Node]

Stale-‐Miss [Local Node] Staleness = 0

Uncached

Refresh [Local Node] Staleness = 0

34

Shared

Stale

Uncached


Shared

Stale


Evict [Local Node]





Evict [Local Node]


Uncached


35

9/14/16

7


Shared

Stale


Evict [Local Node]





Evict [Local Node]


Uncached


36


Shared

Stale


Evict [Local Node]





Evict [Local Node]


Uncached


37


Shared

Stale


Evict [Local Node]





Evict [Local Node]


Uncached


38 Experimental Setup

Tardis -- 16-node cluster

Programs

SSSP Single Source Shortest Path

CC Connected Components

CD Community Detection

PR PageRank

GC Graph Coloring

Graphs (#V, #E)

Orkut 3.07m, 234m

LiveJournal 4.85m, 69m

Pokec 1.63m, 30.6m

HiggsTwitter 0.46m, 14.8m

RoadNetCA 1.97m, 5.5m

RoadNetTX 1.38m, 3.84m

39

CD CC GC PR SSSP

Orkut 1.57x 2.60x 2.14x 1.55x 2.43x

LiveJournal 3.71x 3.30x 2.47x 1.83x 2.39x

Pokec 2.40x 2.04x 2.26x 6.42x 2.09x

HiggsTwitter 1.94x 3.34x 2.05x 6.78x 1.25x

RoadNetCA 1.35x 1.13x 0.56x 1.02x 1.03x

RoadNetTX 0.94x 0.94x 0.88x 0.87x 0.98x RCP is over 2x faster for Graph Algorithms

Execution Time: RCP over SCP 40 Remote Fetches

RCP blocks on 42% of fetches Best Stale-n blocks on 86%

41

9/14/16

8

Iterations for Convergence

RCP requires 49% more iterations Stale-2 & 3 require 146% & 176% more

42 Staleness Percentage

97% of values have staleness 0 2% of values have staleness 1

43

RCP compares well with GraphLab RCP is orthogonal to GraphLab

RCP over GraphLab [VLDB’12] SSSP PR CC

Orkut 1.48x 1.01x 1.13x

LiveJournal 0.72x 0.86x 3.03x

Pokec 0.92x 0.94x 5.71x

HiggsTwitter 2.20x -- 3.95x

RoadNetCA 1.22x 11.5x 3.88x

RoadNetTX 0.41x 15.5x 2.27x

44

Vora, Koduru, & Gupta [OOPSLA 2014] Exploiting Asynchronous Parallelism in Iterative Algorithms using Relaxed Consistency based DSM.

Other Work: Graphs on Clusters u  Evolving Graphs [TACO 2016] u  Streaming Graphs u  Confined Recovery

45

Out-‐of-‐core Graph Processing GraphChi [OSDI’12] – disk-‐friendly Graph is split across multiple shards Created during pre-‐processing Remain static throughout processing

Load & Process one shard at a time Iterative processing – I/O bound GraphChi spends 73-‐88% time on loading

46

Not all edges are always required

Opportunity

0%

20%

40%

60%

80%

100%

10 20 30 40 50 60 70

% U

sefu

l Ed

ges

Iteration

LJNFUKTTFT

PageRank

047

9/14/16

9

Challenges Different algorithms behave differently

Need dynamic partitions

0

0.2

0.4

0.6

0.8

1

4 6 8 10

Idea

l Sh

ard

Siz

e

Iteration

PRMSSP

CC

048 Challenges

How to create dynamic partitions

on the 'ly?

How to process dynamic partitions?

Different algorithms behave differently

Need dynamic partitions

0

0.2

0.4

0.6

0.8

1

4 6 8 10

Idea

l Sh

ard

Siz

e

Iteration

PRMSSP

CC

048

Shards Maintain locality while processing

a b

c

e

d

Src Dst Value a b e0 a c e1 d a e2 e c e3

Shard 0

Src Dst Value b d e4 d e e5 e d e6

Shard 1

049


a b

c

Shards Maintain locality while processing Load, compute, store

Shard 0


Shard 1

Main Memory

a b

c

e

d e

d

d

e c

a b

049


a b

c


Shard 0


Shard 1

Main Memory

a b

c

e

d e

d

d

e c

a b

049


a b

c


Shard 0


Shard 1

Main Memory

a b

c

e

d e

d

d

e c

a b

049

9/14/16

10

Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Src Dst Value b d 6 d e 4 e d 1

Shard 1

Src Dst Value b d 3 e d 6

Src Dst Value a b 4 a c 6 e c 8

Src Dst Value e d 9

Src Dst Value a b 2 a c 3

Time

3 out of 7 Edges

050

How to efGiciently create dynamic shards on the 'ly?

Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0


Shard 1



Src Dst Value e d 9


Time

050

Creating Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0


Shard 1



Time

Main Memory c

a b

c

a b

a b

c

e

d a b

c

51 Creating Dynamic Shards

Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0


Shard 1



Time

Main Memory

a b

c

e

d e

d

d

e

d

e

51


Shard 0


Shard 1



Time

Main Memory

a b

c

c

a b


Src Dst Value

c

a b

a b

c

e

d

51 Creating Dynamic Shards


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


Src Dst Value

e

d

d

e

d

e

51

9/14/16

11


Shard 0


Shard 1



Time


Src Dst Value

Sequential writes Light-‐weight

51 Processing Dynamic Shards


Shard 0


Shard 1



Time

Main Memory

a b

c

a b

c

e

d

52

Processing Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0


Shard 1



Time

Main Memory

a b

c

a b

c

e

d

How to process vertices with missing edges?

52

a b

c

a

Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0


Shard 1



Time

Main Memory

a b

c

e

d

Delay computations

53

a

c

b a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d

Delay computations

53

a

c

b a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d

Delay computations Src Dst Value Src Dst Value

b d 5

53

9/14/16

12

a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


b d 5

e

d

53

a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


b d 5

e

d

53

a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


b d 5

e

d

e

d

53

a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


b d 5

e

d d e

53

Delayed Computations are held in-‐memory buffer Periodically process delayed vertices Shadow iteration -‐-‐ all edges are made available

a Main

Memory d e

Delay based Processing 54

a

Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0


Shard 1



Time

Main Memory

a b

c

e

d

Src Dst Value Src Dst Value b d 5

a b

c

d e

55

9/14/16

13

a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


a b

c d e

55

a


Shard 0


Shard 1



Time

Main Memory

a b

c

e

d


a b

c

a b

c

d e

55

a



Shard 1



Time

Main Memory

a b

c

e

d


c

a b

d e

55

a



Shard 1



Time

Main Memory

a b

c

e

d


c

a b

d e

55



Shard 1



Time

Main Memory

a b

c

e

d


e

d

e

d

d

d e

55 Shadow Iteration



Shard 1



Time

Main Memory

a b

c

e

d


d

e

d e

55

9/14/16

14



Shard 1



Time

Main Memory

a b

c

e

d


d d

e e

55 Dynamic Shards

S0

S1

Sn-‐1

DS00

DS01

DS0n-‐1

DSi0

DSi1

Dsin-‐1

DSi0+1

DSi1+1

DSi+1 n-‐1

Shadow Iteration

56

Dynamic Shards

S0

S1

Sn-‐1

DS00

DS01

DS0n-‐1

DSi0

DSi1

Dsin-‐1

DSi0+1

DSi1+1

DSi+1 n-‐1

Shadow Iteration

64-‐73% of active vertices get delayed

56 Accumulation based Processing

Before Now

d c

b

e e

d c

b

e e

d c

b

e e

d c

b

e e

No Del

ay

Partial

Delay

e

57

Accumulation based Processing 58

Evaluation Setup Dell 8-‐core, 8 GB main memory Dell 500GB 7.2K RPM HDD, Dell 400GB SSD Ubuntu 14.04 File system caches dlushed

59

9/14/16

15

Benchmark & Inputs 5 graph algorithms PageRank, Belief Propagation, Heat Simulation, Multiple Source Shortest Paths, Connected Components

6 real-‐world input graphs

60 Performance

1.8x average speedup

0

0.5

1

1.5

2

2.5

3

UK

TT FT UK

TT FT UK

TT FT UK

TT FT UK

TT FT

Spee

du

p

CCMSSPHSBPPR

61

Performance 1.8x average speedup

0

0.5

1

1.5

2

2.5

3

UK

TT FT UK

TT FT UK

TT FT UK

TT FT UK

TT FT

Spee

du

p

CCMSSPHSBPPR

61 Dynamic Shard Size

0

0.2

0.4

0.6

0.8

1

15 20 25 30

Nor

mal

ized

Sh

ard

Siz

e

Iteration

0

0.2

0.4

0.6

0.8

1

15 20 25 30 35 40

Nor

mal

ized

Sh

ard

Siz

e

Iteration

0

0.2

0.4

0.6

0.8

1

5 6 7 8 9 10 11 12 13

Nor

mal

ized

Sh

ard

Siz

e

Iteration

PR on FT HS on FT

BP on FT

62

Reads & Writes Up to 64% reduction in data read Up to 54% reduction in data written Shadow iterations are I/O intensive

0

0.2

0.4

0.6

0.8

1

UK

TT FT UK

TT FT UK

TT FT

Rea

d S

ize

Regular ReadsShadow Reads

HSBPPR

0 0.2 0.4 0.6 0.8

1 1.2 1.4

UK

TT FT UK

TT FT UK

TT FT

Wri

tes

Size

Regular WritesShadow Writes

HSBPPR

63

Vora, Xu, & Gupta [USENIX ATC 2016] Load the Edges You Need: A Generic I/O Optimization for Distributive Disk-based Graph Algorithms.

Other Work: Graphs on Multicores u  Graph Reduction [HPDC 2016] u  Out-of-core Version

64

9/14/16

16

Summary – Graph Processing On GPUs

SIMD-efficiency – Keeping threads busy Communication – Efficient use of PCIe bandwidth

On Clusters Communication – Tolerating network latency

On Multicore I/O Efficiency – Transforming representation

65

final usc-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[micro 2015] branch divergence...

Documents