exploiting locality in graph analytics through …baseline bdfs 0.0 0.2 0.4 0.6 0.8 1.0 speedup...

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING

Anurag Mukkara, Nathan Beckmann,Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez

MICRO 2018

The locality problem of graph processing 2

¨ Irregular structure of graphs causes seemingly random memory references¨ On-chip caches are too small to fit most real-world graphs

¨ Software frameworks improve locality through offline preprocessing¨ Preprocessing is expensive and often impractical

Baseline GOrder0.0

0.2

0.4

0.6

0.8

1.0

1.2

Mai

n M

emor

y Ac

cess

es

Baseline GOrder0.0

0.2

0.4

0.6

0.8

1.0

1.2

Exec

utio

n tim

e

3616

Preproc.overhead1 PageRank iteration

on UK web graph

Improving locality in an online fashion 3

¨ Traversal schedule decides order in which graph edges are processed

¨ Many real-world graphs have strong community structure

¨ Traversals that follow community structure have good locality

¨ Performing this in software without preprocessing is not practical due to scheduling overheads

BaselineBDFS0.0

0.2

0.4

0.6

0.8

1.0

Speedup

Baselin

eBDFS

Baselin

e-

HATS BD

FS-

HATS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Speedup 1.8x

2.7x

Contributions 4

¨ BDFS: Bounded Depth-First Scheduling¤ Performs a series of bounded depth-first explorations¤ Improves locality for graphs with good community structure

¨ HATS: Hardware Accelerated Traversal Scheduling¤ A simple unit specialized for traversal scheduling¤ Cheap and implementable in reconfigurable logic

PageRank Delta onUK web graph

BaselineBDFS0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

1.8x

Agenda 5

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

Graph data structures 6

0 2 4 7 9

1 2 0 3 0 1 3 0 2

Offsetarray

Neighborarray

3 2

0 1

0.8 7.9 3.6 1.2Vertexdata

Graph representation

Algorithm-specific

Compressed Sparse Row (CSR) Format

VO0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

NeighborsOffsetsVertex Data

Vertex-ordered (VO) schedule follows layout order 7

V0 V1

Vertex dataV3V2

Full Spatial LocalityNo Temporal Locality

Low Spatial LocalityLow Temporal Locality

PageRank on UK web graphStorage

Tim

e

Load/Store

Neighbor array1 2 300 13 0 2

¨ Simplifies scheduling and parallelism¨ Poor locality for vertex data accesses

Agenda 8

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

BDFS: Bounded Depth-First Scheduling 9

¨ Vertex data accesses have high potential temporal locality¨ Following community structure helps harness this locality¨ BDFS performs a series of bounded depth-first explorations

¨ Traversal starts at vertex with id 0¨ Processes all edges of first community

before moving to second

¨ Divide-and-conquer nature of BDFS¤ Small depth bounds capture most locality¤ Good locality at all cache levels

BDFS reduces total main memory accesses 10

BDFS

Low Spatial LocalityHigh Temporal Locality

Lower Spatial LocalityNo Temporal Locality

VO BDFS0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

NeighborsOffsetsVertex Data

PageRank on UK web graph

Tim

e

Vertex data

High Spatial LocalityNo Temporal Locality

Low Spatial LocalityLow Temporal Locality

StorageTi

me

Neighbor array

VO

BDFS in software does not improve performance 11

¨ Scheduling overheads negate the benefits of better locality

¨ Higher instruction count

¨ Limited ILP and MLP¤ Interleaved execution of traversal

scheduling and edge processing¤ Unpredictable data-dependent branches

VO BDFS0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

VO BDFS0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Instructions

VO BDFS0.0

0.2

0.4

0.6

0.8

1.0

1.2

Exec

utio

n Ti

me

Memory Accesses

InstructionsExecution

Time

Agenda 12

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

HATS: Hardware Accelerated Traversal Scheduling 13

¨ Decouples traversal scheduling from edge processing logic¨ Small hardware unit near each core to perform traversal scheduling¨ General-purpose core runs algorithm-specific edge processing logic¨ HATS is decoupled from the core and runs ahead of it

L2L1Core

L2L1Core

…

SharedL3MainMemory

HATS HATS

HATS operation and design 14

HATS

Config. FIFOEdgeBuffer

L1

Core

PrefetchesHATSaccs.

Coreaccs.

L2 ScanFetchOffsets

FetchNeighbors Prefetch

VO-HATS

Stack

ExplorationFSM

Scan FetchOffsets

FetchNeighbors Prefetch

BDFS-HATS

HATS costs 15

¨ Adds only one new instruction¤ Fetches edge from FIFO buffer to core registers

¨ Very cheap and energy-efficient over a general-purpose core¤ RTL synthesis with a 65nm process and 1GHz target frequency

ASIC FPGAArea TDP Area

0.4% of core 0.2% of core 3200 LUTs

HATS benefits 16

¨ Reduces work for general-purpose core for VO

¨ Enables sophisticated scheduling like BDFS

¨ Performs accurate indirect prefetching of vertex data

¨ Accelerates a wide range of algorithms

¨ Requires changes to graph framework only, not algorithm code

Agenda 17

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

Evaluation methodology 18

¨ Event-driven simulation using zsim¨ 16-core processor

¤ Haswell-like OOO cores¤ 32 MB L3 cache¤ 4 memory controllers

¨ IMP [Yu, MICRO’15]¤ Indirect Memory Prefetcher¤ Configured with graph data

structure information for accurate prefetching

¨ 5 applications from Ligra framework¤ PageRank (PR)¤ PageRank Delta (PRD)¤ Connected Components (CC)¤ Radii Estimation (RE)¤ Maximal Independent Set (MIS)

¨ 5 large real-world graph inputs¤ Millions of vertices¤ Billions of edges

PR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)

IMP VO-HATS BDFS-HATS

HATS improves performance significantly 19

¨ IMP improves performance by hiding latency

¨ VO-HATS outperforms IMP by offloading traversal scheduling from general-purpose core

¨ BDFS-HATS gives further gains by reducing memory accessesPR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)


PR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)


PR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)


PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy

VO IMP VO-HATS BDFS-HATS

HATS reduces both on-chip and off-chip energy 20

¨ IMP reduces static energy due to faster execution time

¨ VO-HATS reduces core energy due to lower instruction count

¨ BDFS-HATS reduces memory energy due to better localityPR PRD CC RE MIS

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy


PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy


PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy


PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy


Off-chip(Memory)

On-chip(Core + Cache)

See paper for more results 21

¨ HATS on an on-chip reconfigurable fabric¤ Parallelism enhancements to maintain throughput at slower clock cycle

¨ Sensitivity to on-chip location of HATS (L1, L2, LLC)

¨ Adaptive-HATS¤ Avoids performance loss for graphs with no community structure

¨ HATS versus other locality optimizations

Conclusion 22

¨ Graph processing is bottlenecked by main memory accesses

¨ BDFS exploits community structure to improve cache locality

¨ HATS accelerates traversal scheduling to make BDFS practical

Thanks For Your Attention!Questions Are Welcome!

exploiting locality in graph analytics through …baseline bdfs 0.0 0.2 0.4 0.6 0.8 1.0 speedup...

Documents