design and evaluation of scalable concurrent queues … · nvidia driver v. 313.3 with cuda ... but...

41
LLNL-PRES-666776 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures ICPE 2015 Thomas R. W. Scogland, Wu-chun Feng February 2 nd , 2015

Upload: hahuong

Post on 25-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

LLNL-PRES-666776 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

ICPE 2015

Thomas R. W. Scogland, Wu-chun Feng

February 2nd, 2015

Page 2: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 3

Why another concurrent queue?

Page 3: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 4

Heterogeneity and many-core are a fact of life in modern computing

Page 4: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 5

Everything from cell phones

By Zach Vega (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Page 5: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 6

To supercomputers

Image Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy

Page 6: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 7

Why not existing lock-free queues? !  Traditional lock-free queues focus on progress over throughput

!  Perfect for over-subscribed systems, but they do not scale

7"

0

500

1000

1500

2000

2500

3 3 6 6 9 9 12 12 15 15 18 18 21 21 24 24 27 27 30 30

Ope

ratio

ns p

er m

illis

econ

d

Independent threads

Four Opteron 6134 CPUs

0

500

1000

1500

2000

2500

11 83 155 227 299 371 443 515 587 659 731 803

Ope

ratio

ns p

er m

illis

econ

d

Independent threads

One NVIDIA K20c GPU

Page 7: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 8

!  Definitions and abstractions !  Building blocks: Evaluating atomic operations !  Queue types and modeling !  Our queue design !  Performance evaluation !  Conclusions

Outline

Page 8: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 9

!  Work-item: The basic unit of work in OpenCL •  Groups of work-items execute in lock-step •  Work-items are not threads

!  Thread: An independently schedulable entity •  An OS thread on CPUs •  In OpenCL, defined as a group of work-items of size

“PREFERRED_WORK_GROUP_SIZE_MULTIPLE”

Definitions: What is a “thread”?

Page 9: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 10

!  All operations defined in terms of atomics !  On CPU:

•  Add: Atomic Fetch-and-add (FAA) •  Read: Normal load •  Write: Normal store •  CAS: Atomic Compare and Swap

!  On OpenCL: •  Add: Atomic Fetch-and-add (FAA) •  Read: Atomic Fetch-and-add 0, or atomic_read, or regular

read after flush if available •  Write: Atomic exchange •  CAS: Atomic Compare and Swap

Abstractions

Page 10: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 11

Device Num. devices

Cores/device

Threads/core

Max. threads

Max. achieved

AMD Opteron 6134 4 8 1 32 32 AMD Opteron 6272 2 16 1 32 32 Intel Xeon E5405 2 4 1 8 8 Intel Xeon X5680 1 12 2 24 24 Intel Core i5-3400 1 4 1 4 4

Experimental Setup: Hardware: CPUs

Page 11: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 12

Device Num. devices

Cores/device

Threads/core

Max. threads

Max. achieved

AMD HD5870 1 20 24 496 140 AMD HD7970 1 32 40 1280 386 AMD HD7990 1 (of 2 dies) 32 40 1280 1020 Intel Xeon Phi P1750

1 61 4 244 244

NVIDIA GTX 280 1 30 32 960 960 NVIDIA Tesla C2070

1 14 32 448 448

NVIDIA Tesla K20c 1 13 64 832 832

Experimental Setup: Hardware: GPUs/Co-processors

Page 12: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 13

!  Debian Wheezy Linux 64-bit kernel version 3.2 !  NVIDIA driver v. 313.3 with CUDA SDK 5.0 !  AMD fglrx driver v. 9.1.11 and APP SDK v. 2.8 !  Intel Xeon Phi driver MPSS gold 3 !  CPU and Phi OpenMP use Intel ICC v. 13.0.1

Experimental setup: Software

Page 13: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 14

Experimental setup: Detecting the real number of threads

of the atomic-operation throughput of many-core architec-tures. Section 5 presents the design of our queue and itsthree interfaces while Section 6 discusses linearizability [7].Section 7 presents our experimental setup and benchmarkswhile Section 8 presents the results of our experiments. Sec-tion 9 presents concluding remarks and future work.

2. BACKGROUNDIn order to discuss the properties of our target architec-

tures in a uniform manner, we first present our abstractionof the concurrency and memory model that we use acrossdevices. This section discusses the abstraction that we em-ploy in this paper in order to discuss OpenMP on CPUs andOpenCL on CPUs, GPUs and co-processors all interchange-ably, along with our microbenchmark evaluation of atomicoperations that make this possible across each architecture.

2.1 Threading AbstractionWhile the threading models of OpenMP and OpenCL are

significantly di↵erent, they can be reconciled. An OpenCLkernel runs a set of work-groups, each consisting of work-items or, as they are sometimes unfortunately misnamed“threads.” We exclusively use the term “work-item” to referto these throughout this paper. Work-items are usually asingle lane of a vector computation, rather than an indepen-dent thread of control. In OpenMP, there is no observableequivalent to the work-item, though a single iteration of aloop parallelized by an omp simd directive would be closest.

OpenCL does have an equivalent to the OpenMP threadhowever, but its interpretation changes from device to de-vice. In NVIDIA GPUs, one thread is a “warp,” composedof 32 work-items. In AMD GPUs, a thread is a “wavefront”of 32 or 64 work-items. When run on CPUs, work-itemsmay be either operating system threads or individual lanesof vector calculations as on GPUs. For common CPUs, thismeans each thread may be composed of one to eight work-items. The width of the thread-equivalent used in a com-piled kernel in OpenCL can be reliably determined based onthe OpenCL 1.1 kernel work group info property “preferredwork group size multiple,” which is what we use in our im-plementations. To establish consistent terminology, we usethe term thread to refer to OpenMP threads on the CPU andXeon Phi or independent groups of work-items in OpenCL.Work-items within a thread must execute in lockstep. It isunsafe for more than one work-item in a thread to interactwith a concurrent data structure simultaneously. When athread accesses any queue in this paper, only one work-itemis active.

The additional wrinkle is that OpenCL has no mechanismto get the number of threads that actually run concurrently.While a user can request any number of threads, the num-ber that run concurrently can be anywhere from one to therequested number. We add counters as depicted in Figure 1to all benchmarks to count the number of threads that existbefore the first thread finishes execution, which is a reliableupper bound on the number of concurrent working threadsregardless of the behavior of the OpenCL runtime.

2.2 Memory ModelCPUmodels like OpenMP depend on cache-coherent shared

memory for correctness. The OpenCL standard does notprovide a su�ciently strong coherence model or a mem-ory flush that can be used to implement one. The stan-

void test(unsigned *num_threads, unsigned *present){if(atomic_read(num_threads) != 0)

return;atomic_fetch_and_add(present,1);run_benchmark();atomic_compare_and_swap(num_threads, 0, atomic_read(present));

}

Figure 1: Design of concurrency detection in OpenCL bench-marks

dard states that “there are no guarantees of memory consis-tency between di↵erent work-groups executing a kernel [1].”Writes in di↵erent work-groups are only guaranteed to besynchronized at the end of a kernel, and are thus available insubsequent kernels. The standard specifically allows writesto global memory to never become visible to other work-groups within a single kernel.The exception is atomic operations, available since OpenCL

1.1, which are guaranteed to be visible and coherent acrosswork-groups within a kernel as long as all work-groups areexecuting on the same device. Thus, every write and everyread to global memory that is shared between work-groupsmust be atomic to ensure correctness in OpenCL. In prac-tice, some OpenCL devices support a more coherent memorymodel than this, but it is not required and several architec-tures do not. For example, NVIDIA GPUs present a weakcoherence model, but o↵er a fence/flush through the PTXinstruction membar.gl, but this is not standard OpenCL andmust be used carefully. AMD GPUs have similar instruc-tions at the ISA level but inline assembly only accepts theintermediate CAL language, which has no equivalent.For consistency, we express all algorithms as a set of ab-

stract atomically coherent instructions. In OpenCL, all op-erations are implemented with explicit atomic intrinsics, in-cluding load and store, to maintain coherence. In the CPUimplementation, atomic reads and writes are standard loadand store instructions, while fetch-and-add (FAA) and compare-and-swap (CAS) use the sequentially consistent memory or-dering. The algorithm does not intrinsically require thatthe ordering be that strong, using the relaxed model on in-crements with matching acquire and release on reads andexchanges would be su�cient. We use the stronger consis-tency model because it is the default in OpenCL 2.0 andthe only model exposed in OpenCL 1.2, which we used toimplement our non-CPU device tests.

3. RELATED WORKConcurrent queues have been studied for decades, nearly

as long as computers with multiple computational units haveexisted to run them. We will elide some of the early historyand refer the reader to the surveys provided by the papersreferenced below, especially the Michael and Scott [11] sur-vey, which provides significant discussion of early designs.Array queues. The array queue proposed by Gottlieb et

al. [4] in 1983 is notable for scaling near-linearly to 100 coresin simulation at the time. The Gottlieb queue can scale toas many threads as the hardware can run concurrently dueto the use of a a pair of counters to select a location andfine-grained locking on each location in the queue. Unfortu-nately, however, the Gottlieb queue has been proven to benon-linearizable [2] due to the counters, which can cause thequeue to appear empty or full spuriously. Orozco et al. [13]present two related array queues called the Circular Bu↵erQueue (CB-Queue) and the High-Throughput Queue (HT-

Check if kernel is complete

Increment number of threads, returns TID

Set kernel complete

Page 14: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 15

!  Definitions, abstractions and experimental setup !  Building blocks: Evaluating atomic operations !  Queue types and modeling !  Our queue design !  Performance evaluation !  Conclusions

Outline

Page 15: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 16

Atomic performance test

kernel void cas_test(__global unsigned * in, __global unsigned * out, unsigned iterations){! const unsigned tid = (get_local_id(1)*get_local_size(0)) + get_local_id(0);! const unsigned gid = (get_group_id(1)*get_local_size(0)) + get_group_id(0);! __local unsigned success;! unsigned my_success = 0;!! if(tid == 0){! unsigned prev = 0; ! for(size_t i=0; i < iterations; ++i){! prev = atomic_add(in,0);! my_success += atomic_cmpxchg(in,prev,prev+1) == prev ? 1 : 0; ! } ! out[gid] = my_success;! } !}

Page 16: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 17

Atomic operation performance

17"Independent threads Thro

ughp

ut in

mill

ion

oper

atio

ns p

er s

econ

d

●● ●

●● ●● ●●● ●

● ● ● ● ● ● ● ● ●

●●●●●●●●●●●●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

250

500

750

0

20

40

60

80

0

1

2

3

4

0

200

400

600

30

60

90

0

5

10

15

10

20

30

40

50

0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsTh

roug

hput

in m

illion

ope

ratio

ns p

er s

econ

d

Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG

●● ●

●● ●● ●●● ●

● ● ● ● ● ● ● ● ●

●●●●●●●●●●●●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

250

500

750

0

20

40

60

80

0

1

2

3

4

0

200

400

600

30

60

90

0

5

10

15

10

20

30

40

50

0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT

hrou

ghpu

t in

mill

ion

oper

atio

ns p

er s

econ

d

Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG

●● ●

●● ●● ●●● ●

● ● ● ● ● ● ● ● ●

●●●●●●●●●●●●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

250

500

750

0

20

40

60

80

0

1

2

3

4

0

200

400

600

30

60

90

0

5

10

15

10

20

30

40

50

0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT

hrou

ghpu

t in

mill

ion

oper

atio

ns p

er s

econ

d

Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG

●● ●

●● ●● ●●● ●

● ● ● ● ● ● ● ● ●

●●●●●●●●●●●●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

250

500

750

0

20

40

60

80

0

1

2

3

4

0

200

400

600

30

60

90

0

5

10

15

10

20

30

40

50

0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT

hrou

ghpu

t in

mill

ion

oper

atio

ns p

er s

econ

d

Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG

●● ●

●● ●● ●●● ●

● ● ● ● ● ● ● ● ●

●●●●●●●●●●●●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ●

● ●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

250

500

750

0

20

40

60

80

0

1

2

3

4

0

200

400

600

30

60

90

0

5

10

15

10

20

30

40

50

0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT

hrou

ghpu

t in

mill

ion

oper

atio

ns p

er s

econ

d

Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG

Successful CAS rate decreases with number of

threads!

Other atomic operations scale up with the thread

count

For more architectures, see the paper

CAS: 1,478/ms FAA: 859,524/ms

FAA is 581 times faster

Page 17: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 18

!  Definitions and abstractions !  Building blocks: Evaluating atomic operations !  Queue types and modeling !  Our queue design !  Performance evaluation !  Conclusions

Outline

Page 18: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 19

!  All concurrent queues require either: •  Locks, or •  Atomic operations

!  Model result: Throughput (T) for a given number of threads (t)

!  Terms, average latency of constituent atomics: •  Read: r •  Write: w •  Successful contended CAS: c •  Attempted CAS: C

General modeling of queues

Page 19: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 20

!  Contended CAS •  MS queue and TZ queue

!  Un-contended CAS •  LCRQ

!  Combining •  FC queue

!  FAA or blocking array •  CB queue and our queue

Queue types

Tt = 2rt ×2+ ct( )+ (rt +wt + ct)

Tt = 1at + rt +Ct( )

Tt = 2r1+w1×2( )+ (r1×2+w1)

Tt = 2at + rt +wt( )+ (at +wt ×2)

Page 20: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 25

Modeled queue throughput

25"Independent threads Th

roug

hput

in m

illio

n op

erat

ions

per

sec

ond

For more architectures, see the paper

● ● ● ● ● ● ● ● ● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ●

● ●●

●●

● ● ●● ●

● ● ●

●●

●●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

100

200

300

5

10

15

20

0

50

100

150

200

15

20

25

30

1

2

3

4

5

6

9

12

15

18

0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads

Thro

ughp

ut in

milli

on o

pera

tions

per

sec

ond

Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue

● ● ● ● ● ● ● ● ● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ●

● ●●

●●

● ● ●● ●

● ● ●

●●

●●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

100

200

300

5

10

15

20

0

50

100

150

200

15

20

25

30

1

2

3

4

5

6

9

12

15

18

0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads

Thro

ughp

ut in

milli

on o

pera

tions

per

sec

ond

Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue

2x Opteron 6272

● ● ● ● ● ● ● ● ● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ●

● ●●

●●

● ● ●● ●

● ● ●

●●

●●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

100

200

300

5

10

15

20

0

50

100

150

200

15

20

25

30

1

2

3

4

5

6

9

12

15

18

0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads

Thro

ughp

ut in

milli

on o

pera

tions

per

sec

ond

Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue

Intel Xeon Phi

● ● ● ● ● ● ● ● ● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ●

● ●●

●●

● ● ●● ●

● ● ●

●●

●●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

100

200

300

5

10

15

20

0

50

100

150

200

15

20

25

30

1

2

3

4

5

6

9

12

15

18

0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads

Thro

ughp

ut in

milli

on o

pera

tions

per

sec

ond

Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue

NVIDIA K20c

● ● ● ● ● ● ● ● ● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ●

● ●●

●●

● ● ●● ●

● ● ●

●●

●●

Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s

0

100

200

300

5

10

15

20

0

50

100

150

200

15

20

25

30

1

2

3

4

5

6

9

12

15

18

0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads

Thro

ughp

ut in

milli

on o

pera

tions

per

sec

ond

Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue

AMD HD7970

Combining queue performance is independent

of thread count

Contended-CAS queue performance degrades as

threads increase

Un-contended-CAS and FAA queues scale with

additional threads

Page 21: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 26

!  Definitions and abstractions !  Building blocks: Evaluating atomic operations !  Queue types and modeling !  Our queue design !  Performance evaluation !  Conclusions

Outline

Page 22: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 27

Our queue design: Goals !  Scale well on many-core architectures

•  Avoid contended CAS!

!  Maintain Linearizability and FIFO ordering

!  Allow the status of the queue to be inspected

Page 23: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 29

!  Blocking interface: The fast, concurrent interface •  enqueue(q, data) -> success or closed •  dequeue(q, &data) -> success or closed

!  Non-waiting interface: •  enqueue_nw(q, data) -> success, not_ready or closed •  dequeue_nw(q, &data) -> success, not_ready or closed

!  Status inspection interface •  distance(q) -> the distance between head and tail, corrected for

rollover •  waiting_enqueuers(q) -> number of enqueuers blocking •  waiting_dequeuers(q) -> number of dequeuers blocking •  is_full(q) -> true if full, else false •  is_empty(q) -> true if empty, else false

Our queue design: Solution, divide the interfaces

Page 24: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 31

Our queue’s blocking behavior: Enqueue example: Get targets with FAA

31"

0 3 Tail Head

Thread 1 Thread 2 Thread 3 4 0 5 0 6 0

0 0 0 0 0 0 0 0 0 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 25: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 32

Our queue’s blocking behavior: Enqueue example: Get targets with FAA

32"

0 3 Tail Head

Thread 1 Thread 2 Thread 3 4 0 5 0 6 0

0 0 0 0 0 0 0 0 0 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 26: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 33

Our queue’s blocking behavior: Enqueue example: Get targets with FAA

33"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 0 0 0 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 27: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 34

Our queue’s blocking behavior: Enqueue example: Write values

34"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 0 0 0 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 28: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 35

Our queue’s blocking behavior: Enqueue example: Write values

35"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 0 0 0 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 29: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 36

Our queue’s blocking behavior: Enqueue example: Write values

36"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 5 6 4 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 30: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 37

Our queue’s blocking behavior: Enqueue example: Update slots

37"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 5 6 4 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 31: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 38

Our queue’s blocking behavior: Enqueue example: Update slots

38"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 5 6 4 1 2 3

Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array

Page 32: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 39

Our queue’s blocking behavior: Enqueue example: Update slots

39"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 5 6 4 1 2 3

Slot array 0 0 0 0 0 0 1 1 1 1 1 1 Value array

Page 33: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 40

Our queue’s blocking behavior: Enqueue example: Update slots

40"

0 6 Tail Head

Thread 1 Thread 2 Thread 3 4 3 5 5 6 4

0 0 0 0 0 0 5 6 4 1 2 3

Slot array 0 0 0 0 0 0 1 1 1 1 1 1 Value array

Safe and parallel! For complete

implementation details, see the paper

Page 34: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 44

!  Definitions and abstractions !  Building blocks: Evaluating atomic operations !  Queue types and modeling !  Our queue design !  Performance evaluation !  Conclusions

Outline

Page 35: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 45

Evaluation: Queues Under Consideration !  Michael & Scott (MS) queue: Contended CAS

•  Storage: Unbounded linked list •  Progress guarantee: lock-free •  Coherence mechanism: CAS on head and tail

!  Tsigas & Zhang (TZ) queue: Contended CAS •  Storage: Bounded array •  Progress guarantee: lock-free •  Coherence mechanism: CAS on head and tail

!  Flat-combining (FC) queue: Combining •  Storage: Unbounded linked list •  Progress guarantee: lock-free, *blocking* •  Coherence mechanism: Serialization, single worker thread at a time

!  Linked Concurrent Ring Queue (LCRQ): Un-contended CAS •  Storage: Unbounded linked-list of blocking array-based queues •  Progress guarantee: lock-free •  Coherence mechanism: Double-wide CAS (precludes implementation on

AMD GPUs)

Page 36: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 46

Evaluation: Test loops !  Matching enqueue/dequeue:

•  All threads: —  Dequeue a value —  Work on the value for 100 iterations —  Enqueue the new value —  Work out-of-band for 100 iterations

!  Producer/consumer: •  25% of all threads: —  Enqueue a value —  Work for 100 iterations

•  The remaining 75%: —  Dequeue a value —  Work for 100 iterations

!  All tests run for 5 seconds and are self-stopped on the device

Page 37: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 49

●●

●●

● ●● ● ● ●

●●

● ●

●●

● ● ● ● ● ● ●

●●

● ● ●●

● ●

●● ●

2 − AMD Opteron 6272s 2 − Intel Xeon E5405s 4 − AMD Opteron 6134s Intel Xeon X5680

1

2

3

4

5

3

4

5

6

1

2

3

5

10

15

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

Evaluation: CPU performance

Ope

ratio

ns in

mill

ions

per

sec

ond

Independent threads

●●

●●

● ●● ● ● ●

●●

● ●

●●

● ● ● ● ● ● ●

●●

● ● ●●

● ●

●● ●

Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680

1

2

3

4

5

3

4

5

6

1

2

3

5

10

15

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

● ●

●● ●

●● ●

● ●

● ●

● ● ●●

●● ●

●●

● ●●

2 − AMD Opteron 6272s 2 − Intel Xeon E5405s 4 − AMD Opteron 6134s Intel Xeon X5680

2

4

6

1

2

3

4

5

1

2

3

4

2.5

5.0

7.5

10.0

12.5

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

●●

●●

● ●● ● ● ●

●●

● ●

●●

● ● ● ● ● ● ●

●●

● ● ●●

● ●

●● ●

Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680

1

2

3

4

5

3

4

5

6

1

2

3

5

10

15

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

● ●

●● ●

●● ●

● ●

● ●

● ● ●●

●● ●

●●

● ●●

2 − AMD Opteron 6272s 2 − Intel Xeon E5405s 4 − AMD Opteron 6134s Intel Xeon X5680

2

4

6

1

2

3

4

5

1

2

3

4

2.5

5.0

7.5

10.0

12.5

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

Matching tests Producer/Consumer tests

Page 38: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 50

Evaluation: CPU performance: Oversubscribing

Ope

ratio

ns in

mill

ions

per

sec

ond

Independent threads

●●

●●

● ●● ● ● ●

●●

● ●

●●

● ● ● ● ● ● ●

●●

● ● ●●

● ●

●● ●

Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680

1

2

3

4

5

3

4

5

6

1

2

3

5

10

15

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

Producer/Consumer Matching Enq/Deq

0

5

10

15

0

5

10

15

0 50 100 0 50 100Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue ●

Flat−Combining queueLCRQ−32bitMichael and Scott queue

New − Blocking Enq&DeqNew − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

Producer/Consumer Matching Enq/Deq

0

5

10

15

0

5

10

15

0 50 100 0 50 100Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue ●

Flat−Combining queueLCRQ−32bitMichael and Scott queue

New − Blocking Enq&DeqNew − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

Page 39: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 53

● ● ● ●●

● ●● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ●

●●

●● ● ● ●

AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi

NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c

2−6

2−4

2−2

1

4

16

2−4

2−2

1

4

16

64

256

2−4

2−2

1

4

16

64

256

2−3

2−2

2−1

1

2

4

8

2−8

2−6

2−4

2−2

1

4

2−2

1

4

16

2−2

1

4

16

64

256

0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200

0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads

Ope

ratio

ns in

mill

ions

per

sec

ond

(Log

2)

Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

Evaluation: Acc. performance: Current-Gen: Matching benchmark

Ope

ratio

ns in

mill

ions

per

sec

ond

(log2

)

Independent threads

●●

●●

● ●● ● ● ●

●●

● ●

●●

● ● ● ● ● ● ●

●●

● ● ●●

● ●

●● ●

Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680

1

2

3

4

5

3

4

5

6

1

2

3

5

10

15

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

● ● ● ●●

● ●● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ●

●●

●● ● ● ●

AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi

NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c

2−6

2−4

2−2

1

4

16

2−4

2−2

1

4

16

64

256

2−4

2−2

1

4

16

64

256

2−3

2−2

2−1

1

2

4

8

2−8

2−6

2−4

2−2

1

4

2−2

1

4

16

2−2

1

4

16

64

256

0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200

0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads

Ope

ratio

ns in

mill

ions

per

sec

ond

(Log

2)

Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

● ● ● ●●

● ●● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ●

●●

●● ● ● ●

AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi

NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c

2−6

2−4

2−2

1

4

16

2−4

2−2

1

4

16

64

256

2−4

2−2

1

4

16

64

256

2−3

2−2

2−1

1

2

4

8

2−8

2−6

2−4

2−2

1

4

2−2

1

4

16

2−2

1

4

16

64

256

0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200

0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads

Ope

ratio

ns in

mill

ions

per

sec

ond

(Log

2)

Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

LCRQ lags behind by only 17%

1,408 times speedup from MS-queue to the blocking

queue

Page 40: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 54

●●

●● ● ●

● ●● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ● ● ● ● ●

AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi

NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c

2−6

2−4

2−2

1

4

2−4

2−2

1

4

16

64

2−4

2−2

1

4

16

64

256

2−3

2−2

2−1

1

2

4

2−8

2−6

2−4

2−2

1

2−8

2−6

2−4

2−2

1

4

16

2−8

2−6

2−4

2−2

1

4

16

64

0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200

0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads

Ope

ratio

ns in

mill

ions

per

sec

ond

(Log

2)

Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

●●

●● ● ●

● ●● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ● ● ● ● ●

AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi

NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c

2−6

2−4

2−2

1

4

2−4

2−2

1

4

16

64

2−4

2−2

1

4

16

64

256

2−3

2−2

2−1

1

2

4

2−8

2−6

2−4

2−2

1

2−8

2−6

2−4

2−2

1

4

16

2−8

2−6

2−4

2−2

1

4

16

64

0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200

0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads

Ope

ratio

ns in

mill

ions

per

sec

ond

(Log

2)

Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

●●

●● ● ●

● ●● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ● ● ● ● ●

AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi

NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c

2−6

2−4

2−2

1

4

2−4

2−2

1

4

16

64

2−4

2−2

1

4

16

64

256

2−3

2−2

2−1

1

2

4

2−8

2−6

2−4

2−2

1

2−8

2−6

2−4

2−2

1

4

16

2−8

2−6

2−4

2−2

1

4

16

64

0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200

0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads

Ope

ratio

ns in

mill

ions

per

sec

ond

(Log

2)

Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue

Evaluation: Acc. performance: Current-Gen: Prod./Cons.

Ope

ratio

ns in

mill

ions

per

sec

ond

(log2

)

Independent threads

●●

●●

● ●● ● ● ●

●●

● ●

●●

● ● ● ● ● ● ●

●●

● ● ●●

● ●

●● ●

Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680

1

2

3

4

5

3

4

5

6

1

2

3

5

10

15

10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads

Ope

ratio

ns in

milli

ons

per s

econ

d

Queue●

Flat−Combining queueLCRQ−32bit

Michael and Scott queueNew − Blocking Enq&Deq

New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq

Tsigas and Zhang queue

LCRQ drops to 2,700 ops/second

33,043.91 times more ops with blocking than LCRQ in

this case

Page 41: Design and Evaluation of Scalable Concurrent Queues … · NVIDIA driver v. 313.3 with CUDA ... but o↵er a fence/flush through the PTX ... AMD GPUs have similar instruc-tions at

Lawrence Livermore National Laboratory LLNL-PRES-666776 56

!  Designing concurrent data-structures for throughput is important in modern architectures

!  CAS can be dangerous with enough threads !  Our queue shows between a 1.5x and 1000x

speedup over state of the practice for many-core architectures

!  Allowing blocking can be beneficial!

Conclusions