software transactional memory for gpu architectures

46
1 Xi’an Jiaotong University, China 2 Beihang University, China 3 University of Florida, USA Yunlong Xu 1 Rui Wang 2 Nilanjan Goswami 3 Tao Li 3 Lan Gao 2 Depei Qian 1, 2 Software Transactional Memory for GPU Architectures CGO, Orlando, USA February 17, 2014

Upload: others

Post on 18-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software Transactional Memory for GPU Architectures

1 Xi’an Jiaotong University, China 2 Beihang University, China 3 University of Florida, USA

Yunlong Xu1 Rui Wang2 Nilanjan Goswami3 Tao Li3 Lan Gao2 Depei Qian1, 2

Software Transactional Memory for GPU Architectures

CGO, Orlando, USA February 17, 2014

Page 2: Software Transactional Memory for GPU Architectures

Motivation General-Purpose computing on Graphics Processing

Units (GPGPU) High compute throughput and efficiency GPU threads usually operate on independent data

GPU + applications with data dependencies between

threads? Current data synchronization approaches on GPUs Atomic read-modify-write operations Locks constructed by using atomic operations

2

Page 3: Software Transactional Memory for GPU Architectures

Background: GPU Locks

3

GPU lock-based synchronization is challenging Conventional problems related to locking 1000s of concurrently executing threads SIMT execution paradigm

Lock schemes on GPUs Pitfalls due to SIMT Spinlock Deadlock Serialization within warp Low utilization Diverging on failure Livelock

Page 4: Software Transactional Memory for GPU Architectures

Background: GPU Locks

4

repeat locked ← CAS(&lock, 0, 1) until locked = 0 critical section... lock ← 0

GPU lock scheme #1: spinlock Pitfall due to SIMT: deadlock

Page 5: Software Transactional Memory for GPU Architectures

Background: GPU Locks

5

lock √ ×

Thread #1

Deadlock Time

Thread #0

Get lock Get

GPU lock scheme #1: spinlock Pitfall due to SIMT: deadlock

Page 6: Software Transactional Memory for GPU Architectures

Background: GPU Locks

6

GPU lock scheme #2: serialization within warp Pitfall due to SIMT: low hardware utilization

for i ← 1 to WARP_SIZE do if (threadIdx.x % WARP_SIZE) = i then repeat locked ← CAS(&lock, 0, 1) until locked = 0 critical section... lock ← 0

Page 7: Software Transactional Memory for GPU Architectures

Background: GPU Locks

7

lock0 √

Thread #1

Unnecessary waiting

Time

Thread #0

Get

Critical section

Release

√ Get

Critical section

Release

lock0

lock1

lock1

GPU lock scheme #2: serialization within warp Pitfall due to SIMT: low hardware utilization

Page 8: Software Transactional Memory for GPU Architectures

Background: GPU Locks

8

done ← false while done = false do if CAS(&lock, 0, 1) = 0 then critical section... lock ← 0 done ← true

GPU lock scheme #3: diverging on failure Pitfall due to SIMT: livelock

Page 9: Software Transactional Memory for GPU Architectures

Background: GPU Locks

9

lock0 √

Thread #1

Time

Thread #0

Get

Release × lock1 Get

lock0 Release lock1 ×

√ lock1 Get

lock0 Get

Livelock

GPU lock scheme #3: diverging on failure Pitfall due to SIMT: livelock

Page 10: Software Transactional Memory for GPU Architectures

Talk Agenda Motivation Background: GPU Locks Transactional Memory (TM) GPU-STM: Software TM for GPUs GPU-STM Code Example GPU-STM Algorithm

Evaluation Conclusion

10

Page 11: Software Transactional Memory for GPU Architectures

... atomic { ... access_shared_data(); ... } ...

Transactional Memory (TM)

Source Code

TM System

Transactions

... atomic { ... access_shared_data(); ... } ...

... atomic { ... access_shared_data(); ... } ...

11

Page 12: Software Transactional Memory for GPU Architectures

Talk Agenda Motivation Background: GPU Locks Transactional Memory (TM) GPU-STM: Software TM for GPUs GPU-STM Code Example GPU-STM Algorithm

Evaluation Conclusion

12

Page 13: Software Transactional Memory for GPU Architectures

GPU-STM Code Example

13

__host__ void host_fun () { STM_STARTUP(); trans_kernel <<<BLOCKS, BLOCK SIZE>>>(); STM_SHUTDOWN(); } __global__ void trans_kernel (){ Warp *warp = STM_NEW_WARP(); TXBegin(warp); ... TXRead(&addr1, warp); TXWrite(&addr2, val, warp); ... TXCommit(warp); STM_FREE_WARP(warp); }

Page 14: Software Transactional Memory for GPU Architectures

GPU-STM Code Example

14

__host__ void host_fun () { STM_STARTUP(); trans_kernel <<<BLOCKS, BLOCK SIZE>>>(); STM_SHUTDOWN(); } __global__ void trans_kernel (){ Warp *warp = STM_NEW_WARP(); TXBegin(warp); ... TXRead(&addr1, warp); TXWrite(&addr2, val, warp); ... TXCommit(warp); STM_FREE_WARP(warp); }

Page 15: Software Transactional Memory for GPU Architectures

Talk Agenda Motivation Background: GPU Locks Transactional Memory (TM) GPU-STM: Software TM for GPUs GPU-STM Code Example GPU-STM Algorithm

Evaluation Conclusion

15

Page 16: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm

16

Val TXRead(Addr addr){ if write_set.find(addr) return write_set(addr) val = *addr validation() lock_log.insert(hash(addr)) return val } TXWrite(Addr addr, Val val){ write_set.update(addr, val) lock_log.insert(hash(addr)) }

void TXCommit () { loop: if !get_locks(lock_log) goto loop validation() update_memory(write_set) release_locks(lock_log) }

A high-level view of GPU-STM algorithm

Page 17: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm

17

Val TXRead(Addr addr){ if write_set.find(addr) return write_set(addr) val = *addr validation() lock_log.insert(hash(addr)) return val } TXWrite(Addr addr, Val val){ write_set.update(addr, val) lock_log.insert(hash(addr)) }

void TXCommit () { loop: if !get_locks(lock_log) goto loop validation() update_memory(write_set) release_locks(lock_log) }

GPU-STM highlight #1: locking algorithm

Page 18: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm

18

Val TXRead(Addr addr){ if write_set.find(addr) return write_set(addr) val = *addr validation() lock_log.insert(hash(addr)) return val } TXWrite(Addr addr, Val val){ write_set.update(addr, val) lock_log.insert(hash(addr)) }

void TXCommit () { loop: if !get_locks(lock_log) goto loop validation() update_memory(write_set) release_locks(lock_log) }

GPU-STM highlight #2: conflict detection algorithm

Page 19: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm: Locking

19

GPU lock scheme #3: diverging on failure

lock0 √

Thread #1

Time

Thread #0

Get

Release × lock1 Get

lock0 Release lock1 ×

√ lock1 Get

lock0 Get

Livelock

Page 20: Software Transactional Memory for GPU Architectures

20

lock0 √ TX #1

Time

TX #0

Get

Commit

× lock1 Get

Get lock0

GPU-STM Algorithm: Locking

Release locks

lock0 Get

Commit

lock1 Get

Release locks

√ √

Solution: sorting before acquiring locks

Page 21: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm: Locking

21

Val TXRead(Addr addr){ if write_set.find(addr) return write_set(addr) val = *addr validation() lock_log.insert(hash(addr)) return val } TXWrite(Addr addr, Val val){ write_set.update(addr, val) lock_log.insert(hash(addr)) }

void TXCommit () { loop: if !get_locks(lock_log) goto loop validation() update_memory(write_set) release_locks(lock_log) }

Encounter-time lock-sorting

Page 22: Software Transactional Memory for GPU Architectures

22

Local lock-log

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0004

0x0003

3 -1

1

2

0 1

2 3

4

5 Locks Indexes

Head

Page 23: Software Transactional Memory for GPU Architectures

23

Encounter-time lock-sorting

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0004

0x0003

3

-1

1

2

0 1

2 3

4

5 Locks Indexes

Local Lock-log

0x0005

Head For each incoming lock:

Page 24: Software Transactional Memory for GPU Architectures

24

Encounter-time lock-sorting

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0003

3

-1

2

0x0005

0 1

2 3

4

5

For each incoming lock: Compare with

existing locks

Locks Indexes

Local Lock-log

Head

0x0004 1

Page 25: Software Transactional Memory for GPU Architectures

25

Encounter-time lock-sorting

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0004

0x0003

0x0005

3

-1

14

2

1

0 1

2 3

4

5 Locks Indexes

Local Lock-log

Head For each incoming lock: Compare with

existing locks Insert into log, and

update indexes

Page 26: Software Transactional Memory for GPU Architectures

26

Encounter-time Lock-sorting

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0004

0x0003

0x0005

3

-1

14

2

1

0 1

2 3

4

5

Average cost: n2/4+ Θ(n) comparisons,

n is num of locks. If n = 64,

n2/4+ Θ(n) > 1024.

Locks Indexes

Local Lock-log

Head

Page 27: Software Transactional Memory for GPU Architectures

27

Hash-table based encounter-time lock-sorting

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0004

0x0003

2

-1

1

-1

0 1

2 3

4

5

Locks Indexes

Hash(lock) = lock % buckets

0

3

Buckets 0x0005

Hash-based Local Lock-log

Page 28: Software Transactional Memory for GPU Architectures

28

Hash-table based encounter-time lock-sorting

GPU-STM Algorithm: Locking

0x0002

0x0008

0x0004

0x0003

0x0005

2

-1

1

-14

-1

0 1

2 3

4

5

Locks Indexes

Hash(lock) = lock % buckets

0

3

Buckets

Hash-based Local Lock-log

Average cost: n2/(4*m)+ Θ(n) comparisons, n is num of locks, m is the num of buckets.

Page 29: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm : Conflict Detection

29

Hierarchical validation Time-based validation + value-based validation

Page 30: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm : Conflict Detection

30

Hierarchical validation Time-based validation false conflict hardware utilization loss due to SIMT

TX #0 TXBegin TXRead(A) TXWrite(C←A+1) TXCommit

TX #1 TXBegin TXWrite(B) TXCommit

Lock Array Memory …

… Lock-bit

sizeof(word)

Version Addr A

Addr C

Addr B False

conflict

Page 31: Software Transactional Memory for GPU Architectures

GPU-STM Algorithm : Conflict Detection

31

Pass ? No

No Conflict Conflict No

Yes

Yes

Pass ?

Time-based validation

Value-based validation

Hierarchical validation Time-based validation + value-based validation

Page 32: Software Transactional Memory for GPU Architectures

Talk Agenda Motivation Background: GPU Locks Transactional Memory (TM) GPU-STM: Software TM for GPUs GPU-STM Code Example GPU-STM Algorithm

Evaluation Conclusion

32

Page 33: Software Transactional Memory for GPU Architectures

33

Implement GPU-STM on top of CUDA runtime Run GPU-STM on a NVIDIA C2070 Fermi GPU Benchmarks 3 STAMP benchmarks + 3 micro-benchmarks

Evaluation

Name Shared Data RD/TX WR/TX TX/Kernel RA 8M 16 16 1M HT 256K 8 8 1M EB 1M-64M 32 32 1M GN 16K/1M 1 1 4M/1M LB 1.75M 352 352 512 KM 2K 32 32 64K

Page 34: Software Transactional Memory for GPU Architectures

34

Performance Results

0

5

10

15

20

25

RA HT GN

Spee

dup

over

CG

L

0 0.5

1 1.5

2 2.5

3

LB KM

VBV

TBV-Sorting

HV-Sorting

HV-Backoff

Optimized

EGPGV

Page 35: Software Transactional Memory for GPU Architectures

35

Performance Results

0

5

10

15

20

25

RA HT GN

Spee

dup

over

CG

L

0 0.5

1 1.5

2 2.5

3

LB KM

HV-Sorting HV-Backoff

Page 36: Software Transactional Memory for GPU Architectures

36

Performance Results

0

5

10

15

20

25

RA HT GN

Spee

dup

over

CG

L

0 0.5

1 1.5

2 2.5

3

LB KM

VBV

TBV-Sorting

HV-Sorting

Page 37: Software Transactional Memory for GPU Architectures

37

Performance Results

0

5

10

15

20

25

RA HT GN

Spee

dup

over

CG

L

0 0.5

1 1.5

2 2.5

3

LB KM

TBV-Sorting

HV-Sorting

Optimized

STM-Optimized: STM-HV-Sorting + STM-TBV-Sorting When amount of shared data < amount of locks, TBV Otherwise, HV

Page 38: Software Transactional Memory for GPU Architectures

38

Performance Results

0

5

10

15

20

25

RA HT GN

Spee

dup

over

CG

L

0 0.5

1 1.5

2 2.5

3

LB KM

Optimized

EGPGV

EGPGV STM [Cederman et al. EGPGV’10]

Page 39: Software Transactional Memory for GPU Architectures

39

Hierarchical Validation vs. Time-based Validation

0 10 20 30 40 50 60 70 80

64 256 1K 4K 16K 64K 256K 1M

Thro

ughp

ut (

TX/m

s)

Threads

TBV-1M TBV-4M TBV-64M HV-1M HV-4M

Amount of shared data: 256MB

Page 40: Software Transactional Memory for GPU Architectures

Talk Agenda Motivation Background: GPU Locks Transactional Memory (TM) GPU-STM: Software TM for GPUs GPU-STM Code Example GPU-STM Algorithm

Evaluation Conclusion

40

Page 41: Software Transactional Memory for GPU Architectures

41

Lock-based synchronization on GPUs is challenging

GPU-STM, a Software TM for GPUs Enables simplified data synchronizations on GPUs Scales to 1000s of TXs Ensures livelock-freedom Runs on commercially available GPUs and runtime Outperforms GPU coarse-grain locks by up to 20x

Conclusion

Page 42: Software Transactional Memory for GPU Architectures

1 Xi’an Jiaotong University, China 2 Beihang University, China 3 University of Florida, USA

Yunlong Xu1 Rui Wang2 Nilanjan Goswami3 Tao Li3 Lan Gao2 Depei Qian1, 2

Software Transactional Memory for GPU Architectures

Page 43: Software Transactional Memory for GPU Architectures

43

Launch Configurations of Workloads

RA HT GN-1, GN-2

LB KM

Thread-blocks

256 256 256, 16 512 64

Threads per Block

256 256 256, 64 256 4

Page 44: Software Transactional Memory for GPU Architectures

44

Execution Time Breakdown

0%

20%

40%

60%

80%

100%

GN-1 GN-2 LB KM

Per

cent

of

Thre

ad

Cyc

les

TX Aborted Committing Locking Consistency Buffering TX Initialization Native Code

Tradeoff: STM overhead vs. scalability enabled

~20x speedup

Page 45: Software Transactional Memory for GPU Architectures

45

Scalability Results

Page 46: Software Transactional Memory for GPU Architectures

Related Work TMs for GPUs A STM for GPUs [Cederman et al. EGPGV’10] KILO TM, a HTM for GPUs [Fung et al. MICRO’11,

MICRO’13]

STMs for CPUs JudoSTM [Olszewski et al. PACT’07] NORec STM [Dalessandro et al. PPoPP’10] TL2 STM [Dice et al. DISC’06] ...

46