exploiting locality in graph analytics through …baseline bdfs 0.0 0.2 0.4 0.6 0.8 1.0 speedup...
TRANSCRIPT
EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING
Anurag Mukkara, Nathan Beckmann,Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez
MICRO 2018
The locality problem of graph processing 2
¨ Irregular structure of graphs causes seemingly random memory references¨ On-chip caches are too small to fit most real-world graphs
¨ Software frameworks improve locality through offline preprocessing¨ Preprocessing is expensive and often impractical
Baseline GOrder0.0
0.2
0.4
0.6
0.8
1.0
1.2
Mai
n M
emor
y Ac
cess
es
Baseline GOrder0.0
0.2
0.4
0.6
0.8
1.0
1.2
Exec
utio
n tim
e
3616
Preproc.overhead1 PageRank iteration
on UK web graph
Improving locality in an online fashion 3
¨ Traversal schedule decides order in which graph edges are processed
¨ Many real-world graphs have strong community structure
¨ Traversals that follow community structure have good locality
¨ Performing this in software without preprocessing is not practical due to scheduling overheads
BaselineBDFS0.0
0.2
0.4
0.6
0.8
1.0
Speedup
Baselin
eBDFS
Baselin
e-
HATS BD
FS-
HATS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Speedup 1.8x
2.7x
Contributions 4
¨ BDFS: Bounded Depth-First Scheduling¤ Performs a series of bounded depth-first explorations¤ Improves locality for graphs with good community structure
¨ HATS: Hardware Accelerated Traversal Scheduling¤ A simple unit specialized for traversal scheduling¤ Cheap and implementable in reconfigurable logic
PageRank Delta onUK web graph
BaselineBDFS0.0
0.2
0.4
0.6
0.8
1.0
Mai
n M
emor
y Ac
cess
es
1.8x
Agenda 5
¨ Background
¨ BDFS
¨ HATS
¨ Evaluation
Graph data structures 6
0 2 4 7 9
1 2 0 3 0 1 3 0 2
Offsetarray
Neighborarray
3 2
0 1
0.8 7.9 3.6 1.2Vertexdata
Graph representation
Algorithm-specific
Compressed Sparse Row (CSR) Format
VO0.0
0.2
0.4
0.6
0.8
1.0
Mai
n M
emor
y Ac
cess
es
NeighborsOffsetsVertex Data
Vertex-ordered (VO) schedule follows layout order 7
V0 V1
Vertex dataV3V2
Full Spatial LocalityNo Temporal Locality
Low Spatial LocalityLow Temporal Locality
PageRank on UK web graphStorage
Tim
e
Load/Store
Neighbor array1 2 300 13 0 2
¨ Simplifies scheduling and parallelism¨ Poor locality for vertex data accesses
Agenda 8
¨ Background
¨ BDFS
¨ HATS
¨ Evaluation
BDFS: Bounded Depth-First Scheduling 9
¨ Vertex data accesses have high potential temporal locality¨ Following community structure helps harness this locality¨ BDFS performs a series of bounded depth-first explorations
¨ Traversal starts at vertex with id 0¨ Processes all edges of first community
before moving to second
¨ Divide-and-conquer nature of BDFS¤ Small depth bounds capture most locality¤ Good locality at all cache levels
BDFS reduces total main memory accesses 10
BDFS
Low Spatial LocalityHigh Temporal Locality
Lower Spatial LocalityNo Temporal Locality
VO BDFS0.0
0.2
0.4
0.6
0.8
1.0
Mai
n M
emor
y Ac
cess
es
NeighborsOffsetsVertex Data
PageRank on UK web graph
Tim
e
Vertex data
High Spatial LocalityNo Temporal Locality
Low Spatial LocalityLow Temporal Locality
StorageTi
me
Neighbor array
VO
BDFS in software does not improve performance 11
¨ Scheduling overheads negate the benefits of better locality
¨ Higher instruction count
¨ Limited ILP and MLP¤ Interleaved execution of traversal
scheduling and edge processing¤ Unpredictable data-dependent branches
VO BDFS0.0
0.2
0.4
0.6
0.8
1.0
Mai
n M
emor
y Ac
cess
es
VO BDFS0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Instructions
VO BDFS0.0
0.2
0.4
0.6
0.8
1.0
1.2
Exec
utio
n Ti
me
Memory Accesses
InstructionsExecution
Time
Agenda 12
¨ Background
¨ BDFS
¨ HATS
¨ Evaluation
HATS: Hardware Accelerated Traversal Scheduling 13
¨ Decouples traversal scheduling from edge processing logic¨ Small hardware unit near each core to perform traversal scheduling¨ General-purpose core runs algorithm-specific edge processing logic¨ HATS is decoupled from the core and runs ahead of it
L2L1Core
L2L1Core
…
SharedL3MainMemory
HATS HATS
HATS operation and design 14
HATS
Config. FIFOEdgeBuffer
L1
Core
PrefetchesHATSaccs.
Coreaccs.
L2 ScanFetchOffsets
FetchNeighbors Prefetch
VO-HATS
Stack
ExplorationFSM
Scan FetchOffsets
FetchNeighbors Prefetch
BDFS-HATS
HATS costs 15
¨ Adds only one new instruction¤ Fetches edge from FIFO buffer to core registers
¨ Very cheap and energy-efficient over a general-purpose core¤ RTL synthesis with a 65nm process and 1GHz target frequency
ASIC FPGAArea TDP Area
0.4% of core 0.2% of core 3200 LUTs
HATS benefits 16
¨ Reduces work for general-purpose core for VO
¨ Enables sophisticated scheduling like BDFS
¨ Performs accurate indirect prefetching of vertex data
¨ Accelerates a wide range of algorithms
¨ Requires changes to graph framework only, not algorithm code
Agenda 17
¨ Background
¨ BDFS
¨ HATS
¨ Evaluation
Evaluation methodology 18
¨ Event-driven simulation using zsim¨ 16-core processor
¤ Haswell-like OOO cores¤ 32 MB L3 cache¤ 4 memory controllers
¨ IMP [Yu, MICRO’15]¤ Indirect Memory Prefetcher¤ Configured with graph data
structure information for accurate prefetching
¨ 5 applications from Ligra framework¤ PageRank (PR)¤ PageRank Delta (PRD)¤ Connected Components (CC)¤ Radii Estimation (RE)¤ Maximal Independent Set (MIS)
¨ 5 large real-world graph inputs¤ Millions of vertices¤ Billions of edges
PR PRD CC RE MIS
0
20
40
60
80
100
120
Spee
dup
over
VO
(%)
IMP VO-HATS BDFS-HATS
HATS improves performance significantly 19
¨ IMP improves performance by hiding latency
¨ VO-HATS outperforms IMP by offloading traversal scheduling from general-purpose core
¨ BDFS-HATS gives further gains by reducing memory accessesPR PRD CC RE MIS
0
20
40
60
80
100
120
Spee
dup
over
VO
(%)
IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS
0
20
40
60
80
100
120
Spee
dup
over
VO
(%)
IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS
0
20
40
60
80
100
120
Spee
dup
over
VO
(%)
IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
ene
rgy
VO IMP VO-HATS BDFS-HATS
HATS reduces both on-chip and off-chip energy 20
¨ IMP reduces static energy due to faster execution time
¨ VO-HATS reduces core energy due to lower instruction count
¨ BDFS-HATS reduces memory energy due to better localityPR PRD CC RE MIS
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
ene
rgy
VO IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
ene
rgy
VO IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
ene
rgy
VO IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
ene
rgy
VO IMP VO-HATS BDFS-HATS
Off-chip(Memory)
On-chip(Core + Cache)
See paper for more results 21
¨ HATS on an on-chip reconfigurable fabric¤ Parallelism enhancements to maintain throughput at slower clock cycle
¨ Sensitivity to on-chip location of HATS (L1, L2, LLC)
¨ Adaptive-HATS¤ Avoids performance loss for graphs with no community structure
¨ HATS versus other locality optimizations
Conclusion 22
¨ Graph processing is bottlenecked by main memory accesses
¨ BDFS exploits community structure to improve cache locality
¨ HATS accelerates traversal scheduling to make BDFS practical
Thanks For Your Attention!Questions Are Welcome!