fast triangle counting on the gpuโฌยฆย ยท computational approaches enumerating over all...
TRANSCRIPT
![Page 1: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/1.jpg)
Fast Triangle Counting
on the GPU
Oded Green
![Page 2: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/2.jpg)
Triangle Counting
Building block for clustering coefficients
Defined by [Watts & Strogatz; 1998]
Used to state how tightly bound vertices are in a network
Used for the analysis many types of networks:
Communication
Social
Biological
[email protected], GTC, 2015 Twitter social network using Large
Graph Layout
![Page 3: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/3.jpg)
Relevant network properties
Sparse-network
System utilization
Power-law distribution
[Faloutsos & Faloutsos & Faloutsos; 1999]
[Broder et al.; 2000]
Load balancing issues
[email protected], GTC, 2015
![Page 4: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/4.jpg)
Computational Approaches
Enumerating over all node-triples - ๐(๐3).
Using matrix multiplication - ๐(๐๐ค), ๐ค โค 2.376.
Adjacency list intersection - ๐ ๐ โ ๐๐๐๐ฅ2 ,
where ๐ ๐๐๐ฅ is the vertex with largest
adjacency.
[email protected], GTC, 2015
![Page 5: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/5.jpg)
For given e = ๐ข, ๐ฃ โ ๐ธ:
Take:
uโs list
vโs list
Intersection and find common neighbors
Then intersect (๐ข, ๐ฅ) and (๐ข, ๐ฆ)
Adjacency List Intersection
x
v u
y
[email protected], GTC, 2015
![Page 6: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/6.jpg)
GPU Challenges: List Intersection
Partitioning / Load balancing
Must be computationally cheap
Must be parallel (high utilization)
Minimal synchronization/communication
Scalable
Simple
[email protected], GTC, 2015
![Page 7: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/7.jpg)
The only other GPU triangle counting algorithm
Uses the GPU like a CPU
One CUDA thread per
intersection
Load balancing issues
Low utilization
Limited scalability
[Heist et al. ;2012]
[email protected], GTC, 2015
![Page 9: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/9.jpg)
Merge-Path
Visual approach for merging
Highly scalable
Load-balanced
Two legal moves
Right
Down
Reminder: Merging and intersecting are similar
[email protected], GTC, 2015 Merge Path : [Odeh et al. ; 2012]
GPU Merge Path : [Green et al. ;2012]
B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]
1 2 4 5 8 9 12 13
A[1] 4
A[2] 6
A[3] 7
A[4] 11
A[5] 12
A[6] 14
A[7] 15
A[8] 16
![Page 10: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/10.jpg)
Partitioning
Find the intersection of the
cross-diagonal and the
path
Uses binary search
[email protected], GTC, 2015
B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]
1 2 4 5 8 9 12 13
A[1] 4
A[2] 6
A[3] 7
A[4] 11
A[5] 12
A[6] 14
A[7] 15
A[8] 16
![Page 11: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/11.jpg)
Merge-Path Phases
Partitioning
Log(n)
Synchronization
Merging
Synchronization
[email protected], GTC, 2015
![Page 12: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/12.jpg)
Intersect-Path
Building block for triangle counting and clustering coefficients
Modified Merge-Path
Right
Down
Right-down โ when values are equal
[email protected], GTC, 2015
B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]
1 2 4 5 8 9 12 13
A[1] 4
A[2] 6
A[3] 7
A[4] 11
A[5] 12
A[6] 14
A[7] 15
A[8] 16
![Page 13: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/13.jpg)
B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]
1 2 4 5 8 9 12 13
A[1] 4
A[2] 6
A[3] 7
A[4] 11
A[5] 12
A[6] 14
A[7] 15
A[8] 16
Differences
Typically unique values in each set
Two possible start locations when cross-diagonal goes through โequal-pathโ.
Requires:
Checking the initial condition
Avoiding double counting
Shared-memory and synchronization
[email protected], GTC, 2015
![Page 14: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/14.jpg)
Triangle-Counting
Split vertices to thread blocks
|V|/Thread_Blocks
Iteratively (over the vertices)
Partition
Log(n)
Warp Synchronization
Multiple Intersection
Thread-block Synchronization
Vertices
Thread-block
Warp 0 Warp 1
[email protected], GTC, 2015
![Page 15: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/15.jpg)
Algorithm Control Parameters
๐ผ: # intersection in a thread block
Small number : under utilization
Large number : not useful for sparse networks
๐โ: # threads per intersection
Small number : load-balancing, under utilization
Large number : adds overhead compared to the intersection
๐ผ โ ๐โ โฅ ๐ค๐๐๐ ๐ ๐๐ง๐
[email protected], GTC, 2015
![Page 17: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/17.jpg)
Experiments
Graphs taken from 10th DIMACS Challenge
GPU
NVIDIA K40
12 GB
14 MPs x 192 SPs (per MP) = 2880 CUDA threads
[email protected], GTC, 2015
![Page 18: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/18.jpg)
Experiments
CPU - OpenMP based [Green et al. ;2014]
4x10 Intel Xeon E7-8870 processers
256 GB
30 MB L3 cache per processor
[email protected], GTC, 2015
![Page 19: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/19.jpg)
Implementation comparison
GPU algorithm as discussed here
CPU algorithms
Threads are independent
Partitioning stage not needed
[email protected], GTC, 2015
![Page 20: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/20.jpg)
GPU Vs. Sequential CPU
[email protected], GTC, 2015
0
5
10
15
20
25
30
35
40
Spe
edu
p
GPU IntersectPath Sequential (Baseline)
![Page 21: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/21.jpg)
GPU Vs. CPU โ OpenMP Straightforward
OpenMP results from[Green et al. ;2014] [email protected], GTC, 2015
0
5
10
15
20
25
30
35
40
Spe
edu
p
Naรฏve OMP GPU IntersectPath Sequential (Baseline) Maximal for 40 cores
![Page 22: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/22.jpg)
GPU Vs. CPU OpenMP Optimized I
[email protected], GTC, 2015 [Green et al. ;2014]
0
5
10
15
20
25
30
35
40
Spe
edu
p
Optimized OMP I GPU IntersectPath Sequential (Baseline) Maximal for 40 cores
![Page 23: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/23.jpg)
[email protected], GTC, 2015
![Page 24: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/24.jpg)
GPU Vs. CPU OpenMP Optimized II
[email protected], GTC, 2015 [Green et al. ;2014]
0
5
10
15
20
25
30
35
40
Spe
edu
p
Optimized OMP II GPU IntersectPath Sequential (Baseline) Maximal for 40 cores
![Page 25: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/25.jpg)
GPU โ Parameter control -prefAttachment
[email protected], GTC, 2015
Single thread per intersection
X-Y-Z configuration:
โข X: threads per
block
โข Y: thread per
intersection
โข Z: number of
intersection per
thread block
![Page 26: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/26.jpg)
[email protected], GTC, 2015
Single Intersection per warp
GPU โ Parameter control -prefAttachment
X-Y-Z configuration:
โข X: threads per
block
โข Y: thread per
intersection
โข Z: number of
intersection per
thread block
![Page 27: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/27.jpg)
These parameters count
3๐ โ 7๐ difference between slowest and
fastest
Offer better system utilization
Offer better load-balancing
[email protected], GTC, 2015
![Page 28: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/28.jpg)
Final thoughts
Intersect-Path
Modified Merge-Path
Performance
9X-32X faster than a sequential
[email protected], GTC, 2015
![Page 29: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/29.jpg)
Additional Challenges (Future work)
Fine-tuning the GPU parameters based on
graph properties.
More load-balancing across the MPs.
Create the capability to analyze large
graphs on the GPU
Limited memory size
[email protected], GTC, 2015
![Page 30: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/30.jpg)
Thank you!
Join our Open-Source community
https://github.com/arrayfire/arrayfire
[email protected], GTC, 2015
![Page 32: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/32.jpg)
Examples โ Clustering Coefficient
๐ถ๐ถ = ๐ถ๐ถ๐ฃ =
๐ฃ
๐ก๐๐(๐ฃ)
deg ๐ฃ โ deg ๐ฃ โ 1๐ฃ
[email protected], GTC, 2015
![Page 33: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/33.jpg)
Immense volume of data
Facebook: >1B users
average 130 friends
30B pieces of content shared / month
Twitter: 200M active users
400M tweets / day
Goal: Use the interaction
network to understand
and characterize
information flow
Motivation: Social Media
Sources: Facebook, Twitter
[email protected], GTC, 2015
![Page 34: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/34.jpg)
GPU Vs. CPU
[email protected], GTC, 2015
0
5
10
15
20
25
30
35
40
Spee
du
p
Naรฏve OMP Optimized OMP I Optimized OMP II
GPU IntersectPath Sequential (Baseline) Maximal for 40 cores
![Page 35: Fast Triangle Counting on the GPUโฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐(๐3). Using matrix multiplication - ๐(๐ ), โค t. u y x. Adjacency](https://reader035.vdocuments.us/reader035/viewer/2022062605/5fd2d277b088523a1174f6cb/html5/thumbnails/35.jpg)
Speedup/Dollar
[email protected], GTC, 2015
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
Spee
du
p/d
olla
r
Naรฏve OMP Optimized OMP I Optimized OMP II GPU IntersectPath
Assuming
CPU : $20K
GPU : $5K