exploiting locality in graph analytics through …baseline bdfs 0.0 0.2 0.4 0.6 0.8 1.0 speedup...

22
EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018

Upload: others

Post on 22-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING

Anurag Mukkara, Nathan Beckmann,Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez

MICRO 2018

Page 2: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

The locality problem of graph processing 2

¨ Irregular structure of graphs causes seemingly random memory references¨ On-chip caches are too small to fit most real-world graphs

¨ Software frameworks improve locality through offline preprocessing¨ Preprocessing is expensive and often impractical

Baseline GOrder0.0

0.2

0.4

0.6

0.8

1.0

1.2

Mai

n M

emor

y Ac

cess

es

Baseline GOrder0.0

0.2

0.4

0.6

0.8

1.0

1.2

Exec

utio

n tim

e

3616

Preproc.overhead1 PageRank iteration

on UK web graph

Page 3: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Improving locality in an online fashion 3

¨ Traversal schedule decides order in which graph edges are processed

¨ Many real-world graphs have strong community structure

¨ Traversals that follow community structure have good locality

¨ Performing this in software without preprocessing is not practical due to scheduling overheads

Page 4: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

BaselineBDFS0.0

0.2

0.4

0.6

0.8

1.0

Speedup

Baselin

eBDFS

Baselin

e-

HATS BD

FS-

HATS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Speedup 1.8x

2.7x

Contributions 4

¨ BDFS: Bounded Depth-First Scheduling¤ Performs a series of bounded depth-first explorations¤ Improves locality for graphs with good community structure

¨ HATS: Hardware Accelerated Traversal Scheduling¤ A simple unit specialized for traversal scheduling¤ Cheap and implementable in reconfigurable logic

PageRank Delta onUK web graph

BaselineBDFS0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

1.8x

Page 5: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Agenda 5

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

Page 6: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Graph data structures 6

0 2 4 7 9

1 2 0 3 0 1 3 0 2

Offsetarray

Neighborarray

3 2

0 1

0.8 7.9 3.6 1.2Vertexdata

Graph representation

Algorithm-specific

Compressed Sparse Row (CSR) Format

Page 7: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

VO0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

NeighborsOffsetsVertex Data

Vertex-ordered (VO) schedule follows layout order 7

V0 V1

Vertex dataV3V2

Full Spatial LocalityNo Temporal Locality

Low Spatial LocalityLow Temporal Locality

PageRank on UK web graphStorage

Tim

e

Load/Store

Neighbor array1 2 300 13 0 2

¨ Simplifies scheduling and parallelism¨ Poor locality for vertex data accesses

Page 8: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Agenda 8

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

Page 9: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

BDFS: Bounded Depth-First Scheduling 9

¨ Vertex data accesses have high potential temporal locality¨ Following community structure helps harness this locality¨ BDFS performs a series of bounded depth-first explorations

¨ Traversal starts at vertex with id 0¨ Processes all edges of first community

before moving to second

¨ Divide-and-conquer nature of BDFS¤ Small depth bounds capture most locality¤ Good locality at all cache levels

Page 10: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

BDFS reduces total main memory accesses 10

BDFS

Low Spatial LocalityHigh Temporal Locality

Lower Spatial LocalityNo Temporal Locality

VO BDFS0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

NeighborsOffsetsVertex Data

PageRank on UK web graph

Tim

e

Vertex data

High Spatial LocalityNo Temporal Locality

Low Spatial LocalityLow Temporal Locality

StorageTi

me

Neighbor array

VO

Page 11: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

BDFS in software does not improve performance 11

¨ Scheduling overheads negate the benefits of better locality

¨ Higher instruction count

¨ Limited ILP and MLP¤ Interleaved execution of traversal

scheduling and edge processing¤ Unpredictable data-dependent branches

VO BDFS0.0

0.2

0.4

0.6

0.8

1.0

Mai

n M

emor

y Ac

cess

es

VO BDFS0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Instructions

VO BDFS0.0

0.2

0.4

0.6

0.8

1.0

1.2

Exec

utio

n Ti

me

Memory Accesses

InstructionsExecution

Time

Page 12: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Agenda 12

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

Page 13: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

HATS: Hardware Accelerated Traversal Scheduling 13

¨ Decouples traversal scheduling from edge processing logic¨ Small hardware unit near each core to perform traversal scheduling¨ General-purpose core runs algorithm-specific edge processing logic¨ HATS is decoupled from the core and runs ahead of it

L2L1Core

L2L1Core

SharedL3MainMemory

HATS HATS

Page 14: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

HATS operation and design 14

HATS

Config. FIFOEdgeBuffer

L1

Core

PrefetchesHATSaccs.

Coreaccs.

L2 ScanFetchOffsets

FetchNeighbors Prefetch

VO-HATS

Stack

ExplorationFSM

Scan FetchOffsets

FetchNeighbors Prefetch

BDFS-HATS

Page 15: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

HATS costs 15

¨ Adds only one new instruction¤ Fetches edge from FIFO buffer to core registers

¨ Very cheap and energy-efficient over a general-purpose core¤ RTL synthesis with a 65nm process and 1GHz target frequency

ASIC FPGAArea TDP Area

0.4% of core 0.2% of core 3200 LUTs

Page 16: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

HATS benefits 16

¨ Reduces work for general-purpose core for VO

¨ Enables sophisticated scheduling like BDFS

¨ Performs accurate indirect prefetching of vertex data

¨ Accelerates a wide range of algorithms

¨ Requires changes to graph framework only, not algorithm code

Page 17: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Agenda 17

¨ Background

¨ BDFS

¨ HATS

¨ Evaluation

Page 18: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Evaluation methodology 18

¨ Event-driven simulation using zsim¨ 16-core processor

¤ Haswell-like OOO cores¤ 32 MB L3 cache¤ 4 memory controllers

¨ IMP [Yu, MICRO’15]¤ Indirect Memory Prefetcher¤ Configured with graph data

structure information for accurate prefetching

¨ 5 applications from Ligra framework¤ PageRank (PR)¤ PageRank Delta (PRD)¤ Connected Components (CC)¤ Radii Estimation (RE)¤ Maximal Independent Set (MIS)

¨ 5 large real-world graph inputs¤ Millions of vertices¤ Billions of edges

Page 19: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

PR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)

IMP VO-HATS BDFS-HATS

HATS improves performance significantly 19

¨ IMP improves performance by hiding latency

¨ VO-HATS outperforms IMP by offloading traversal scheduling from general-purpose core

¨ BDFS-HATS gives further gains by reducing memory accessesPR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)

IMP VO-HATS BDFS-HATS

PR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)

IMP VO-HATS BDFS-HATS

PR PRD CC RE MIS

0

20

40

60

80

100

120

Spee

dup

over

VO

(%)

IMP VO-HATS BDFS-HATS

Page 20: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy

VO IMP VO-HATS BDFS-HATS

HATS reduces both on-chip and off-chip energy 20

¨ IMP reduces static energy due to faster execution time

¨ VO-HATS reduces core energy due to lower instruction count

¨ BDFS-HATS reduces memory energy due to better localityPR PRD CC RE MIS

0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy

VO IMP VO-HATS BDFS-HATS

PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy

VO IMP VO-HATS BDFS-HATS

PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy

VO IMP VO-HATS BDFS-HATS

PR PRD CC RE MIS0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

ene

rgy

VO IMP VO-HATS BDFS-HATS

Off-chip(Memory)

On-chip(Core + Cache)

Page 21: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

See paper for more results 21

¨ HATS on an on-chip reconfigurable fabric¤ Parallelism enhancements to maintain throughput at slower clock cycle

¨ Sensitivity to on-chip location of HATS (L1, L2, LLC)

¨ Adaptive-HATS¤ Avoids performance loss for graphs with no community structure

¨ HATS versus other locality optimizations

Page 22: EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH …Baseline BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Speedup Baseline BDFS Baseline-HATS BDFS- HATS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Speedup 1.8x 2.7x

Conclusion 22

¨ Graph processing is bottlenecked by main memory accesses

¨ BDFS exploits community structure to improve cache locality

¨ HATS accelerates traversal scheduling to make BDFS practical

Thanks For Your Attention!Questions Are Welcome!