heterosync: a benchmark suite for fine-grained ...rsim.cs.illinois.edu › talks ›...

30
HeteroSync: A Benchmark Suite for Fine-Grained Synchronization on Tightly Coupled GPUs Matthew D. Sinclair , Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign [email protected]

Upload: others

Post on 07-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

HeteroSync: A Benchmark Suite for Fine-Grained

Synchronization on Tightly Coupled GPUs

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve

University of Illinois @ Urbana-Champaign

[email protected]

Page 2: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

2

Traditional Heterogeneous SoC Memory Hierarchies

Works well for streaming applications

Discrete address spaces

Main Memory

Interconnect

Modem

GPS

DSP DSP

GPU

A/V HW Accels.

DSPMulti-

media

CPU

L1 Cache

L2 Cache

CPU

L1 Cache

Vect.Vect.

Inefficient for applications with fine-grained synchronization

Page 3: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• Tighter CPU-GPU integration – need better synch support

• Lots of heterogenous coherence, consistency research

3

Motivation

QuickRelease HPCA’14

DeNovo MICRO ‘15

hLRC MICRO ‘16

HRF ASPLOS ‘14

RemoteScopes ASPLOS ‘15

hVIPS TACO ‘16

RAts ISCA ’17 …

No standardization – which approach is best?

HeteroSync: new microbenchmark suite

Page 4: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• Fine-grained synchronization microbenchmarks

– Various mutex, semaphore, barrier algorithms

– Relaxed atomics: event counters, split counters, seqlocks, …

• Enable deep analysis of:

– Algorithm scalability

– Scalability of different coherence and consistency schemes

4

HeteroSync

Standard fine-grained synch microbenchmarks

Page 5: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Outline

• Motivation

• Background: Coherence & Consistency

• HeteroSync

• Results

• Conclusion

5

Page 6: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Atomics Background

• Default: Data-race-free-0 (DRF0) [Adve ISCA ‘90]

– Identify all races as synchronization accesses (C++: atomics)

– All atomics order data accesses

– Atomics order other atomics

Ensures SC semantics if no data races

6

// each threadfor i = 0:n…ADD R4, A[i], R1ADD R5, B[i], R1…

synch (atomic)

synch (atomic)

Page 7: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Atomics Background (Cont.)

• Default: Data-race-free-0 (DRF0) [Adve ISCA ‘90]

– All atomics order data

– All atomics order other atomic accesses

Ensures SC semantics if no data races

• Relaxed atomics [Boehm PLDI ‘08]

+ Do not order data or other atomics

But can violate SC and no formal specification

• Data-race-free-relaxed (DRFrlx) [Sinclair ISCA ‘17]

SC-centric semantics + efficiency

7

Page 8: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

GPU Coherence with DRF

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished

– Synchronization accesses (atomics) go to last level cache (LLC)

– Synchronization points are expensive, preclude reuse

L2 CacheBank

Interconnection n/w

GPU

L2 CacheBank

CPU

CacheCacheValidDirty

ValidFlush dirty

data

Invalidate

all data

// each thread

for i = r[tid]:r[tid+1]

LOCKLD R1, A[i];LD R2, B[i];R3Math(R1, R2);ST B[i], R3;UNLOCK

Simple but inefficient coherence, simple consistency8

Page 9: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

GPU Coherence with HRF

• New memory model: Heterogeneous-race-free (HRF) [ASPLOS ‘14]

– Adds scoped synchronization

L2 CacheBank

Interconnection n/w

GPU

L2 CacheBank

CPU

CacheCacheValidDirty

ValidFlush dirty

data

Invalidate

all data

// each thread

for i = r[tid]:r[tid+1]

LOCKLD R1, A[i];LD R2, B[i];R3Math(R1, R2);ST B[i], R3;UNLOCK global

global

9

Page 10: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

GPU Coherence with HRF

• New memory model: Heterogeneous-race-free (HRF)

– Adds scoped synchronization

– No overhead for locally scoped synchronizations

• But higher programming complexity

L2 CacheBank

Interconnection n/w

GPU

L2 CacheBank

CPU

CacheCacheDirty

Valid

// each thread

for i = r[tid]:r[tid+1]

LOCKLD R1, A[i];LD R2, B[i];R3Math(R1, R2);ST B[i], R3;UNLOCK global

global

local

local

Keep data

local

More efficient coherence, complex consistency10

Page 11: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

DeNovo Coherence with DRF

• Reuse dirty data across synch points – more data reuse

• Synchronization accesses can be performed at L1 – synch reuse

L2 CacheBank

Interconnection n/w

GPU

L2 CacheBank

CPU

CacheCacheObtain

ownership

Invalidate

non-owned dataDirtyValidOwn

for i = r[tid]:r[tid+1]

LOCKLD R1, A[i];LD R2, B[i];R3Math(R1, R2);ST B[i], R3;UNLOCK

Efficient coherence, simple consistency

DirRegO,GPU

Only track 1

up-to-date copy

11

Page 12: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Outline

• Motivation

• Background: Coherence & Consistency

• HeteroSync

– Synchronization Primitives Microbenchmarks

– Relaxed Atomics Microbenchmarks

• Results

• Conclusion

12

Page 13: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• SyncPrims microbenchmarks [Stuart CoRR ’11]:

– Originally studied synchronization primitive latency

– Focus: performance of atomic operations

– Less Focus: overheads of proper synchronization

• No global data accesses

• Microbenchmarks:

– Mutexes: Spin (with backoff), centralized ticket, ring buffer

– Semaphores: Spin (with backoff)

– Barriers: Centralized, decentralized barriers

13

Synchronization Primitives Microbenchmarks

Page 14: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• Updates [Sinclair MICRO ‘15]:

– Global data accesses in critical sections

– Synchronization loads and stores to enforce ordering

– Two versions of each microbenchmark: local/global scope

– Optimize algorithms

• Microbenchmarks:

– Mutexes: Spin (with backoff), centralized ticket, ring buffer

– Semaphores: Spin (with backoff)

– Barriers: Centralized, decentralized barriers

14

Synchronization Primitives Microbenchmarks

Can vary data size, scope, synchronization primitive

decentralized ticket

2-level tree + local exchange

Page 15: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Outline

• Motivation

• Background: Coherence & Consistency

• HeteroSync

– Synchronization Primitives Microbenchmarks

– Relaxed Atomics Microbenchmarks

• Results

• Conclusion

15

Page 16: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Relaxed Atomic (RAts) Microbenchmarks

• Contacted vendors, developers, and researchers

– Common uses of relaxed atomics [Sinclair ISCA ‘17]:

16

Event Counters

Seqlocks

Flags

Ref Counters

Split Counters

Place events into bins

Shared flag for inter-thread communication

Simultaneously update and get partial sums

Can vary data size, algorithm

Sequence number instead of mutex lock

Track threads using an object; delete if none

Page 17: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Outline

• Motivation

• Background: Coherence & Consistency

• HeteroSync

• Results

• Conclusion

17

Page 18: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

Evaluation Methodology

18

• 1 CPU core + 1-15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

• Simulation Environment

– GEMS, Simics, Garnet, GPGPU-Sim

• HeteroSync microbenchmarks

– SyncPrims: weak scaling

– Relaxed Atomics: strong scaling

Page 19: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

19

Configurations Studied

GPU

GPU

Studied GPU, DeNovo coherence with DRF0, DRFrlx, HRF

GPU

DeNovo

DeNovo

Coherence Consistency

DRF0

HRF

DRF0

DRFrlx

DRFrlx

Abbreviation

GD

GH

DD

GD

DD

0

0

R

R

SyncPrims

Relaxed

Atomics

Page 20: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• How are coherence/consistency schemes impacted?

– Do certain algorithms scale better than others?

– How does an algorithm scale with local/global scope?

– Do relaxed atomics impact scalability?

20

Key Evaluation Questions

Page 21: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

21

Centralized Ticket Lock Scalability (Global Scope)

0

5

10

15

1 2 4 8 15

Cyc

les

(M)

# CU

GH DD

As CUs increase, execution time increases due to increased contention

DeNovo+DRF is able to reuse synch, so scales 20% better than GPU+HRF

= GD

Page 22: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

22

Centralized vs. Decentralized (Global Scope)

0

5

10

15

1 2 4 8 15

Cyc

les

(M)

# CU

GH DD

As CUs increase, execution time increases due to increased contention

Decentralized ticket lock scales better than centralized with DeNovo+DRF

For decentralized, DeNovo+DRF scales 32% better than GPU+HRF

= GD

0

5

10

15

1 2 4 8 15# CU

GH DD= GD

Page 23: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• How are coherence/consistency schemes impacted?

– Do certain algorithms scale better than others?

– How does an algorithm scale with local/global scope?

– Do relaxed atomics impact scalability?

23

Key Evaluation Questions

Coherence protocol impacts which algorithm scales better

Page 24: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

24

Decentralized Ticket Lock Scalability (Local Scope)

0200400600800

10001200

1 2 4 8 15

Cyc

les

(K)

# CU

GD DD

GPU+DRF cannot perform atomics locally, contention increases with # CUs

DeNovo+DRF exploits locality

Page 25: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

25

Decentralized Ticket Lock Scalability (Local Scope)

0200400600800

10001200

1 2 4 8 15

Cyc

les

(K)

# CU

GD GH DD

GPU+DRF cannot perform atomics locally, contention increases with # CUs

DeNovo+DRF exploits locality

GPU+HRF also exploits locality, but increased programming complexity

Page 26: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• How are coherence/consistency schemes impacted?

– Do certain algorithms scale better than others?

– How does an algorithm scale with local/global scope?

– Do relaxed atomics impact scalability?

26

Key Evaluation Questions

Coherence protocol impacts which algorithm scales better

DeNovo+DRF provides best scalability with global scope

GPU+HRF and DeNovo+DRF both scale well with local scope

Page 27: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

27

Split Counters Scalability

Execution time significantly reduced as work spread across more CUs

DeNovo+DRF0: tradeoff between increased reuse, remote accesses

0

100

200

300

400

1 2 4 8 15

Cyc

les

(K)

# CU

GD0 DD0

Page 28: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

28

Split Counters Scalability

0

100

200

300

400

1 2 4 8 15

Cyc

les

(K)

# CU

GD0 DD0 GDR DDR

Execution time significantly reduced as work spread across more CUs

DeNovo+DRF0: tradeoff between increased reuse, remote accesses

Relaxed atomics reduce execution time compared to DRF0

Page 29: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• How are coherence/consistency schemes impacted?

– Do certain algorithms scale better than others?

– How does an algorithm scale with local/global scope?

– Do relaxed atomics impact scalability?

29

Key Evaluation Questions

Coherence protocol impacts which algorithm scales better

DeNovo+DRF provides best scalability with global scope

GPU+HRF and DeNovo+DRF both scale well with local scope

Relaxed atomics reduce execution time, but increase contention

Compare schemes and scalability with HeteroSync

Page 30: HeteroSync: A Benchmark Suite for Fine-Grained ...rsim.cs.illinois.edu › Talks › 17-iiswc-sinclair-heterosync.pdf · GPU Coherence with DRF • With data-race-free (DRF) memory

• HeteroSync: fine-grained GPU synch microbenchmarks

– Synchronization primitives: mutexes, semaphores, barriers

– Relaxed atomics: event counters, split counters, seqlocks, …

– Highly configurable

• Study algorithms, coherence, and consistency

– Examine scalability of existing approaches

• Standard set of GPU microbenchmarks

– Released soon: github.com/mattsinc/heterosync

30

Summary