design and evaluation of scalable concurrent queues … · nvidia driver v. 313.3 with cuda ... but...
TRANSCRIPT
LLNL-PRES-666776 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
ICPE 2015
Thomas R. W. Scogland, Wu-chun Feng
February 2nd, 2015
Lawrence Livermore National Laboratory LLNL-PRES-666776 3
Why another concurrent queue?
Lawrence Livermore National Laboratory LLNL-PRES-666776 4
Heterogeneity and many-core are a fact of life in modern computing
Lawrence Livermore National Laboratory LLNL-PRES-666776 5
Everything from cell phones
By Zach Vega (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Lawrence Livermore National Laboratory LLNL-PRES-666776 6
To supercomputers
Image Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy
Lawrence Livermore National Laboratory LLNL-PRES-666776 7
Why not existing lock-free queues? ! Traditional lock-free queues focus on progress over throughput
! Perfect for over-subscribed systems, but they do not scale
7"
0
500
1000
1500
2000
2500
3 3 6 6 9 9 12 12 15 15 18 18 21 21 24 24 27 27 30 30
Ope
ratio
ns p
er m
illis
econ
d
Independent threads
Four Opteron 6134 CPUs
0
500
1000
1500
2000
2500
11 83 155 227 299 371 443 515 587 659 731 803
Ope
ratio
ns p
er m
illis
econ
d
Independent threads
One NVIDIA K20c GPU
Lawrence Livermore National Laboratory LLNL-PRES-666776 8
! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions
Outline
Lawrence Livermore National Laboratory LLNL-PRES-666776 9
! Work-item: The basic unit of work in OpenCL • Groups of work-items execute in lock-step • Work-items are not threads
! Thread: An independently schedulable entity • An OS thread on CPUs • In OpenCL, defined as a group of work-items of size
“PREFERRED_WORK_GROUP_SIZE_MULTIPLE”
Definitions: What is a “thread”?
Lawrence Livermore National Laboratory LLNL-PRES-666776 10
! All operations defined in terms of atomics ! On CPU:
• Add: Atomic Fetch-and-add (FAA) • Read: Normal load • Write: Normal store • CAS: Atomic Compare and Swap
! On OpenCL: • Add: Atomic Fetch-and-add (FAA) • Read: Atomic Fetch-and-add 0, or atomic_read, or regular
read after flush if available • Write: Atomic exchange • CAS: Atomic Compare and Swap
Abstractions
Lawrence Livermore National Laboratory LLNL-PRES-666776 11
Device Num. devices
Cores/device
Threads/core
Max. threads
Max. achieved
AMD Opteron 6134 4 8 1 32 32 AMD Opteron 6272 2 16 1 32 32 Intel Xeon E5405 2 4 1 8 8 Intel Xeon X5680 1 12 2 24 24 Intel Core i5-3400 1 4 1 4 4
Experimental Setup: Hardware: CPUs
Lawrence Livermore National Laboratory LLNL-PRES-666776 12
Device Num. devices
Cores/device
Threads/core
Max. threads
Max. achieved
AMD HD5870 1 20 24 496 140 AMD HD7970 1 32 40 1280 386 AMD HD7990 1 (of 2 dies) 32 40 1280 1020 Intel Xeon Phi P1750
1 61 4 244 244
NVIDIA GTX 280 1 30 32 960 960 NVIDIA Tesla C2070
1 14 32 448 448
NVIDIA Tesla K20c 1 13 64 832 832
Experimental Setup: Hardware: GPUs/Co-processors
Lawrence Livermore National Laboratory LLNL-PRES-666776 13
! Debian Wheezy Linux 64-bit kernel version 3.2 ! NVIDIA driver v. 313.3 with CUDA SDK 5.0 ! AMD fglrx driver v. 9.1.11 and APP SDK v. 2.8 ! Intel Xeon Phi driver MPSS gold 3 ! CPU and Phi OpenMP use Intel ICC v. 13.0.1
Experimental setup: Software
Lawrence Livermore National Laboratory LLNL-PRES-666776 14
Experimental setup: Detecting the real number of threads
of the atomic-operation throughput of many-core architec-tures. Section 5 presents the design of our queue and itsthree interfaces while Section 6 discusses linearizability [7].Section 7 presents our experimental setup and benchmarkswhile Section 8 presents the results of our experiments. Sec-tion 9 presents concluding remarks and future work.
2. BACKGROUNDIn order to discuss the properties of our target architec-
tures in a uniform manner, we first present our abstractionof the concurrency and memory model that we use acrossdevices. This section discusses the abstraction that we em-ploy in this paper in order to discuss OpenMP on CPUs andOpenCL on CPUs, GPUs and co-processors all interchange-ably, along with our microbenchmark evaluation of atomicoperations that make this possible across each architecture.
2.1 Threading AbstractionWhile the threading models of OpenMP and OpenCL are
significantly di↵erent, they can be reconciled. An OpenCLkernel runs a set of work-groups, each consisting of work-items or, as they are sometimes unfortunately misnamed“threads.” We exclusively use the term “work-item” to referto these throughout this paper. Work-items are usually asingle lane of a vector computation, rather than an indepen-dent thread of control. In OpenMP, there is no observableequivalent to the work-item, though a single iteration of aloop parallelized by an omp simd directive would be closest.
OpenCL does have an equivalent to the OpenMP threadhowever, but its interpretation changes from device to de-vice. In NVIDIA GPUs, one thread is a “warp,” composedof 32 work-items. In AMD GPUs, a thread is a “wavefront”of 32 or 64 work-items. When run on CPUs, work-itemsmay be either operating system threads or individual lanesof vector calculations as on GPUs. For common CPUs, thismeans each thread may be composed of one to eight work-items. The width of the thread-equivalent used in a com-piled kernel in OpenCL can be reliably determined based onthe OpenCL 1.1 kernel work group info property “preferredwork group size multiple,” which is what we use in our im-plementations. To establish consistent terminology, we usethe term thread to refer to OpenMP threads on the CPU andXeon Phi or independent groups of work-items in OpenCL.Work-items within a thread must execute in lockstep. It isunsafe for more than one work-item in a thread to interactwith a concurrent data structure simultaneously. When athread accesses any queue in this paper, only one work-itemis active.
The additional wrinkle is that OpenCL has no mechanismto get the number of threads that actually run concurrently.While a user can request any number of threads, the num-ber that run concurrently can be anywhere from one to therequested number. We add counters as depicted in Figure 1to all benchmarks to count the number of threads that existbefore the first thread finishes execution, which is a reliableupper bound on the number of concurrent working threadsregardless of the behavior of the OpenCL runtime.
2.2 Memory ModelCPUmodels like OpenMP depend on cache-coherent shared
memory for correctness. The OpenCL standard does notprovide a su�ciently strong coherence model or a mem-ory flush that can be used to implement one. The stan-
void test(unsigned *num_threads, unsigned *present){if(atomic_read(num_threads) != 0)
return;atomic_fetch_and_add(present,1);run_benchmark();atomic_compare_and_swap(num_threads, 0, atomic_read(present));
}
Figure 1: Design of concurrency detection in OpenCL bench-marks
dard states that “there are no guarantees of memory consis-tency between di↵erent work-groups executing a kernel [1].”Writes in di↵erent work-groups are only guaranteed to besynchronized at the end of a kernel, and are thus available insubsequent kernels. The standard specifically allows writesto global memory to never become visible to other work-groups within a single kernel.The exception is atomic operations, available since OpenCL
1.1, which are guaranteed to be visible and coherent acrosswork-groups within a kernel as long as all work-groups areexecuting on the same device. Thus, every write and everyread to global memory that is shared between work-groupsmust be atomic to ensure correctness in OpenCL. In prac-tice, some OpenCL devices support a more coherent memorymodel than this, but it is not required and several architec-tures do not. For example, NVIDIA GPUs present a weakcoherence model, but o↵er a fence/flush through the PTXinstruction membar.gl, but this is not standard OpenCL andmust be used carefully. AMD GPUs have similar instruc-tions at the ISA level but inline assembly only accepts theintermediate CAL language, which has no equivalent.For consistency, we express all algorithms as a set of ab-
stract atomically coherent instructions. In OpenCL, all op-erations are implemented with explicit atomic intrinsics, in-cluding load and store, to maintain coherence. In the CPUimplementation, atomic reads and writes are standard loadand store instructions, while fetch-and-add (FAA) and compare-and-swap (CAS) use the sequentially consistent memory or-dering. The algorithm does not intrinsically require thatthe ordering be that strong, using the relaxed model on in-crements with matching acquire and release on reads andexchanges would be su�cient. We use the stronger consis-tency model because it is the default in OpenCL 2.0 andthe only model exposed in OpenCL 1.2, which we used toimplement our non-CPU device tests.
3. RELATED WORKConcurrent queues have been studied for decades, nearly
as long as computers with multiple computational units haveexisted to run them. We will elide some of the early historyand refer the reader to the surveys provided by the papersreferenced below, especially the Michael and Scott [11] sur-vey, which provides significant discussion of early designs.Array queues. The array queue proposed by Gottlieb et
al. [4] in 1983 is notable for scaling near-linearly to 100 coresin simulation at the time. The Gottlieb queue can scale toas many threads as the hardware can run concurrently dueto the use of a a pair of counters to select a location andfine-grained locking on each location in the queue. Unfortu-nately, however, the Gottlieb queue has been proven to benon-linearizable [2] due to the counters, which can cause thequeue to appear empty or full spuriously. Orozco et al. [13]present two related array queues called the Circular Bu↵erQueue (CB-Queue) and the High-Throughput Queue (HT-
Check if kernel is complete
Increment number of threads, returns TID
Set kernel complete
Lawrence Livermore National Laboratory LLNL-PRES-666776 15
! Definitions, abstractions and experimental setup ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions
Outline
Lawrence Livermore National Laboratory LLNL-PRES-666776 16
Atomic performance test
kernel void cas_test(__global unsigned * in, __global unsigned * out, unsigned iterations){! const unsigned tid = (get_local_id(1)*get_local_size(0)) + get_local_id(0);! const unsigned gid = (get_group_id(1)*get_local_size(0)) + get_group_id(0);! __local unsigned success;! unsigned my_success = 0;!! if(tid == 0){! unsigned prev = 0; ! for(size_t i=0; i < iterations; ++i){! prev = atomic_add(in,0);! my_success += atomic_cmpxchg(in,prev,prev+1) == prev ? 1 : 0; ! } ! out[gid] = my_success;! } !}
Lawrence Livermore National Laboratory LLNL-PRES-666776 17
Atomic operation performance
17"Independent threads Thro
ughp
ut in
mill
ion
oper
atio
ns p
er s
econ
d
●
●
●● ●
●● ●● ●●● ●
● ● ● ● ● ● ● ● ●
●●●●●●●●●●●●
●
● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
250
500
750
0
20
40
60
80
0
1
2
3
4
0
200
400
600
30
60
90
0
5
10
15
10
20
30
40
50
0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsTh
roug
hput
in m
illion
ope
ratio
ns p
er s
econ
d
Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG
●
●
●● ●
●● ●● ●●● ●
● ● ● ● ● ● ● ● ●
●●●●●●●●●●●●
●
● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
250
500
750
0
20
40
60
80
0
1
2
3
4
0
200
400
600
30
60
90
0
5
10
15
10
20
30
40
50
0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT
hrou
ghpu
t in
mill
ion
oper
atio
ns p
er s
econ
d
Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG
●
●
●● ●
●● ●● ●●● ●
● ● ● ● ● ● ● ● ●
●●●●●●●●●●●●
●
● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
250
500
750
0
20
40
60
80
0
1
2
3
4
0
200
400
600
30
60
90
0
5
10
15
10
20
30
40
50
0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT
hrou
ghpu
t in
mill
ion
oper
atio
ns p
er s
econ
d
Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG
●
●
●● ●
●● ●● ●●● ●
● ● ● ● ● ● ● ● ●
●●●●●●●●●●●●
●
● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
250
500
750
0
20
40
60
80
0
1
2
3
4
0
200
400
600
30
60
90
0
5
10
15
10
20
30
40
50
0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT
hrou
ghpu
t in
mill
ion
oper
atio
ns p
er s
econ
d
Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG
●
●
●● ●
●● ●● ●●● ●
● ● ● ● ● ● ● ● ●
●●●●●●●●●●●●
●
● ● ● ● ● ● ●
●
●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
● ●
● ●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA GTX 280 Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
250
500
750
0
20
40
60
80
0
1
2
3
4
0
200
400
600
30
60
90
0
5
10
15
10
20
30
40
50
0 200 400 600 50 100 150 200 0 100 200 300 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threadsT
hrou
ghpu
t in
mill
ion
oper
atio
ns p
er s
econ
d
Operation ●Attempted CAS FAA READ Successful CAS WRITE XCHG
Successful CAS rate decreases with number of
threads!
Other atomic operations scale up with the thread
count
For more architectures, see the paper
CAS: 1,478/ms FAA: 859,524/ms
FAA is 581 times faster
Lawrence Livermore National Laboratory LLNL-PRES-666776 18
! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions
Outline
Lawrence Livermore National Laboratory LLNL-PRES-666776 19
! All concurrent queues require either: • Locks, or • Atomic operations
! Model result: Throughput (T) for a given number of threads (t)
! Terms, average latency of constituent atomics: • Read: r • Write: w • Successful contended CAS: c • Attempted CAS: C
General modeling of queues
Lawrence Livermore National Laboratory LLNL-PRES-666776 20
! Contended CAS • MS queue and TZ queue
! Un-contended CAS • LCRQ
! Combining • FC queue
! FAA or blocking array • CB queue and our queue
Queue types
Tt = 2rt ×2+ ct( )+ (rt +wt + ct)
Tt = 1at + rt +Ct( )
Tt = 2r1+w1×2( )+ (r1×2+w1)
Tt = 2at + rt +wt( )+ (at +wt ×2)
Lawrence Livermore National Laboratory LLNL-PRES-666776 25
Modeled queue throughput
25"Independent threads Th
roug
hput
in m
illio
n op
erat
ions
per
sec
ond
For more architectures, see the paper
● ● ● ● ● ● ● ● ● ●
● ●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●●
●
●●
● ● ●● ●
● ● ●
●●
●
●
●●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
100
200
300
5
10
15
20
0
50
100
150
200
15
20
25
30
1
2
3
4
5
6
9
12
15
18
0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads
Thro
ughp
ut in
milli
on o
pera
tions
per
sec
ond
Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue
● ● ● ● ● ● ● ● ● ●
● ●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●●
●
●●
● ● ●● ●
● ● ●
●●
●
●
●●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
100
200
300
5
10
15
20
0
50
100
150
200
15
20
25
30
1
2
3
4
5
6
9
12
15
18
0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads
Thro
ughp
ut in
milli
on o
pera
tions
per
sec
ond
Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue
2x Opteron 6272
● ● ● ● ● ● ● ● ● ●
● ●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●●
●
●●
● ● ●● ●
● ● ●
●●
●
●
●●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
100
200
300
5
10
15
20
0
50
100
150
200
15
20
25
30
1
2
3
4
5
6
9
12
15
18
0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads
Thro
ughp
ut in
milli
on o
pera
tions
per
sec
ond
Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue
Intel Xeon Phi
● ● ● ● ● ● ● ● ● ●
● ●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●●
●
●●
● ● ●● ●
● ● ●
●●
●
●
●●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
100
200
300
5
10
15
20
0
50
100
150
200
15
20
25
30
1
2
3
4
5
6
9
12
15
18
0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads
Thro
ughp
ut in
milli
on o
pera
tions
per
sec
ond
Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue
NVIDIA K20c
● ● ● ● ● ● ● ● ● ●
● ●
●
●●
● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●●
●
●●
● ● ●● ●
● ● ●
●●
●
●
●●
●
Acc., AMD HD7970 Acc., Intel Xeon Phi Acc., NVIDIA Tesla K20c CPU, 1 − Intel Xeon X5680 CPU, 2 − AMD Opteron 6272s CPU, 2 − Intel Xeon E5405s
0
100
200
300
5
10
15
20
0
50
100
150
200
15
20
25
30
1
2
3
4
5
6
9
12
15
18
0 200 400 600 50 100 150 200 0 200 400 600 5 10 15 20 25 10 20 30 2 4 6 8Concurrent threads
Thro
ughp
ut in
milli
on o
pera
tions
per
sec
ond
Operation ●Combining queue Contended CAS queue FAA queue Un−Contended CAS queue
AMD HD7970
Combining queue performance is independent
of thread count
Contended-CAS queue performance degrades as
threads increase
Un-contended-CAS and FAA queues scale with
additional threads
Lawrence Livermore National Laboratory LLNL-PRES-666776 26
! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions
Outline
Lawrence Livermore National Laboratory LLNL-PRES-666776 27
Our queue design: Goals ! Scale well on many-core architectures
• Avoid contended CAS!
! Maintain Linearizability and FIFO ordering
! Allow the status of the queue to be inspected
Lawrence Livermore National Laboratory LLNL-PRES-666776 29
! Blocking interface: The fast, concurrent interface • enqueue(q, data) -> success or closed • dequeue(q, &data) -> success or closed
! Non-waiting interface: • enqueue_nw(q, data) -> success, not_ready or closed • dequeue_nw(q, &data) -> success, not_ready or closed
! Status inspection interface • distance(q) -> the distance between head and tail, corrected for
rollover • waiting_enqueuers(q) -> number of enqueuers blocking • waiting_dequeuers(q) -> number of dequeuers blocking • is_full(q) -> true if full, else false • is_empty(q) -> true if empty, else false
Our queue design: Solution, divide the interfaces
Lawrence Livermore National Laboratory LLNL-PRES-666776 31
Our queue’s blocking behavior: Enqueue example: Get targets with FAA
31"
0 3 Tail Head
Thread 1 Thread 2 Thread 3 4 0 5 0 6 0
0 0 0 0 0 0 0 0 0 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 32
Our queue’s blocking behavior: Enqueue example: Get targets with FAA
32"
0 3 Tail Head
Thread 1 Thread 2 Thread 3 4 0 5 0 6 0
0 0 0 0 0 0 0 0 0 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 33
Our queue’s blocking behavior: Enqueue example: Get targets with FAA
33"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 0 0 0 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 34
Our queue’s blocking behavior: Enqueue example: Write values
34"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 0 0 0 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 35
Our queue’s blocking behavior: Enqueue example: Write values
35"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 0 0 0 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 36
Our queue’s blocking behavior: Enqueue example: Write values
36"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 5 6 4 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 37
Our queue’s blocking behavior: Enqueue example: Update slots
37"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 5 6 4 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 38
Our queue’s blocking behavior: Enqueue example: Update slots
38"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 5 6 4 1 2 3
Slot array 0 0 0 0 0 0 0 0 0 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 39
Our queue’s blocking behavior: Enqueue example: Update slots
39"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 5 6 4 1 2 3
Slot array 0 0 0 0 0 0 1 1 1 1 1 1 Value array
Lawrence Livermore National Laboratory LLNL-PRES-666776 40
Our queue’s blocking behavior: Enqueue example: Update slots
40"
0 6 Tail Head
Thread 1 Thread 2 Thread 3 4 3 5 5 6 4
0 0 0 0 0 0 5 6 4 1 2 3
Slot array 0 0 0 0 0 0 1 1 1 1 1 1 Value array
Safe and parallel! For complete
implementation details, see the paper
Lawrence Livermore National Laboratory LLNL-PRES-666776 44
! Definitions and abstractions ! Building blocks: Evaluating atomic operations ! Queue types and modeling ! Our queue design ! Performance evaluation ! Conclusions
Outline
Lawrence Livermore National Laboratory LLNL-PRES-666776 45
Evaluation: Queues Under Consideration ! Michael & Scott (MS) queue: Contended CAS
• Storage: Unbounded linked list • Progress guarantee: lock-free • Coherence mechanism: CAS on head and tail
! Tsigas & Zhang (TZ) queue: Contended CAS • Storage: Bounded array • Progress guarantee: lock-free • Coherence mechanism: CAS on head and tail
! Flat-combining (FC) queue: Combining • Storage: Unbounded linked list • Progress guarantee: lock-free, *blocking* • Coherence mechanism: Serialization, single worker thread at a time
! Linked Concurrent Ring Queue (LCRQ): Un-contended CAS • Storage: Unbounded linked-list of blocking array-based queues • Progress guarantee: lock-free • Coherence mechanism: Double-wide CAS (precludes implementation on
AMD GPUs)
Lawrence Livermore National Laboratory LLNL-PRES-666776 46
Evaluation: Test loops ! Matching enqueue/dequeue:
• All threads: — Dequeue a value — Work on the value for 100 iterations — Enqueue the new value — Work out-of-band for 100 iterations
! Producer/consumer: • 25% of all threads: — Enqueue a value — Work for 100 iterations
• The remaining 75%: — Dequeue a value — Work for 100 iterations
! All tests run for 5 seconds and are self-stopped on the device
Lawrence Livermore National Laboratory LLNL-PRES-666776 49
●●
●●
● ●● ● ● ●
●
●
●●
● ●
●
●●
● ● ● ● ● ● ●
●●
● ● ●●
● ●
●● ●
2 − AMD Opteron 6272s 2 − Intel Xeon E5405s 4 − AMD Opteron 6134s Intel Xeon X5680
1
2
3
4
5
3
4
5
6
1
2
3
5
10
15
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
Evaluation: CPU performance
Ope
ratio
ns in
mill
ions
per
sec
ond
Independent threads
●●
●●
● ●● ● ● ●
●
●
●●
● ●
●
●●
● ● ● ● ● ● ●
●●
● ● ●●
● ●
●● ●
Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680
1
2
3
4
5
3
4
5
6
1
2
3
5
10
15
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
●
●
● ●
●● ●
●● ●
●
●
●
●
●
● ●
●
● ●
● ● ●●
●
●
●
●
●● ●
●●
●
● ●●
2 − AMD Opteron 6272s 2 − Intel Xeon E5405s 4 − AMD Opteron 6134s Intel Xeon X5680
2
4
6
1
2
3
4
5
1
2
3
4
2.5
5.0
7.5
10.0
12.5
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
●●
●●
● ●● ● ● ●
●
●
●●
● ●
●
●●
● ● ● ● ● ● ●
●●
● ● ●●
● ●
●● ●
Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680
1
2
3
4
5
3
4
5
6
1
2
3
5
10
15
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
●
●
● ●
●● ●
●● ●
●
●
●
●
●
● ●
●
● ●
● ● ●●
●
●
●
●
●● ●
●●
●
● ●●
2 − AMD Opteron 6272s 2 − Intel Xeon E5405s 4 − AMD Opteron 6134s Intel Xeon X5680
2
4
6
1
2
3
4
5
1
2
3
4
2.5
5.0
7.5
10.0
12.5
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
Matching tests Producer/Consumer tests
Lawrence Livermore National Laboratory LLNL-PRES-666776 50
Evaluation: CPU performance: Oversubscribing
Ope
ratio
ns in
mill
ions
per
sec
ond
Independent threads
●●
●●
● ●● ● ● ●
●
●
●●
● ●
●
●●
● ● ● ● ● ● ●
●●
● ● ●●
● ●
●● ●
Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680
1
2
3
4
5
3
4
5
6
1
2
3
5
10
15
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
●
●● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
Producer/Consumer Matching Enq/Deq
0
5
10
15
0
5
10
15
0 50 100 0 50 100Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue ●
Flat−Combining queueLCRQ−32bitMichael and Scott queue
New − Blocking Enq&DeqNew − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
●
●● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
Producer/Consumer Matching Enq/Deq
0
5
10
15
0
5
10
15
0 50 100 0 50 100Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue ●
Flat−Combining queueLCRQ−32bitMichael and Scott queue
New − Blocking Enq&DeqNew − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
Lawrence Livermore National Laboratory LLNL-PRES-666776 53
● ● ● ●●
● ●● ● ● ● ● ●
● ● ● ● ● ● ●
●
●
● ● ● ● ● ●
●
●
●●
●● ● ● ●
AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi
NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c
2−6
2−4
2−2
1
4
16
2−4
2−2
1
4
16
64
256
2−4
2−2
1
4
16
64
256
2−3
2−2
2−1
1
2
4
8
2−8
2−6
2−4
2−2
1
4
2−2
1
4
16
2−2
1
4
16
64
256
0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200
0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads
Ope
ratio
ns in
mill
ions
per
sec
ond
(Log
2)
Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue
Evaluation: Acc. performance: Current-Gen: Matching benchmark
Ope
ratio
ns in
mill
ions
per
sec
ond
(log2
)
Independent threads
●●
●●
● ●● ● ● ●
●
●
●●
● ●
●
●●
● ● ● ● ● ● ●
●●
● ● ●●
● ●
●● ●
Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680
1
2
3
4
5
3
4
5
6
1
2
3
5
10
15
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
● ● ● ●●
● ●● ● ● ● ● ●
● ● ● ● ● ● ●
●
●
● ● ● ● ● ●
●
●
●●
●● ● ● ●
AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi
NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c
2−6
2−4
2−2
1
4
16
2−4
2−2
1
4
16
64
256
2−4
2−2
1
4
16
64
256
2−3
2−2
2−1
1
2
4
8
2−8
2−6
2−4
2−2
1
4
2−2
1
4
16
2−2
1
4
16
64
256
0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200
0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads
Ope
ratio
ns in
mill
ions
per
sec
ond
(Log
2)
Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue
● ● ● ●●
● ●● ● ● ● ● ●
● ● ● ● ● ● ●
●
●
● ● ● ● ● ●
●
●
●●
●● ● ● ●
AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi
NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c
2−6
2−4
2−2
1
4
16
2−4
2−2
1
4
16
64
256
2−4
2−2
1
4
16
64
256
2−3
2−2
2−1
1
2
4
8
2−8
2−6
2−4
2−2
1
4
2−2
1
4
16
2−2
1
4
16
64
256
0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200
0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads
Ope
ratio
ns in
mill
ions
per
sec
ond
(Log
2)
Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue
LCRQ lags behind by only 17%
1,408 times speedup from MS-queue to the blocking
queue
Lawrence Livermore National Laboratory LLNL-PRES-666776 54
●●
●
●
●
●● ● ●
●
●
●
●
●
● ●● ●
●●
● ●
●●
●●
● ●
● ●
●
●●
● ● ● ● ● ●
AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi
NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c
2−6
2−4
2−2
1
4
2−4
2−2
1
4
16
64
2−4
2−2
1
4
16
64
256
2−3
2−2
2−1
1
2
4
2−8
2−6
2−4
2−2
1
2−8
2−6
2−4
2−2
1
4
16
2−8
2−6
2−4
2−2
1
4
16
64
0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200
0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads
Ope
ratio
ns in
mill
ions
per
sec
ond
(Log
2)
Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue
●●
●
●
●
●● ● ●
●
●
●
●
●
● ●● ●
●●
● ●
●●
●●
● ●
● ●
●
●●
● ● ● ● ● ●
AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi
NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c
2−6
2−4
2−2
1
4
2−4
2−2
1
4
16
64
2−4
2−2
1
4
16
64
256
2−3
2−2
2−1
1
2
4
2−8
2−6
2−4
2−2
1
2−8
2−6
2−4
2−2
1
4
16
2−8
2−6
2−4
2−2
1
4
16
64
0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200
0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads
Ope
ratio
ns in
mill
ions
per
sec
ond
(Log
2)
Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue
●●
●
●
●
●● ● ●
●
●
●
●
●
● ●● ●
●●
● ●
●●
●●
● ●
● ●
●
●●
● ● ● ● ● ●
AMD HD5870 AMD HD7970 AMD HD7990 Intel Xeon Phi
NVIDIA GeForce GTX 280 NVIDIA Tesla C2070 NVIDIA Tesla K20c
2−6
2−4
2−2
1
4
2−4
2−2
1
4
16
64
2−4
2−2
1
4
16
64
256
2−3
2−2
2−1
1
2
4
2−8
2−6
2−4
2−2
1
2−8
2−6
2−4
2−2
1
4
16
2−8
2−6
2−4
2−2
1
4
16
64
0 25 50 75 100 125 0 100 200 300 0 250 500 750 1000 50 100 150 200
0 250 500 750 0 100 200 300 400 0 200 400 600 800Independent threads
Ope
ratio
ns in
mill
ions
per
sec
ond
(Log
2)
Queue ●Flat−Combining queue LCRQ−32bit Michael and Scott queue New − Blocking Enq&Deq New − Non−waiting Deq, Blocking Enq New − Non−waiting Enq&Deq Tsigas and Zhang queue
Evaluation: Acc. performance: Current-Gen: Prod./Cons.
Ope
ratio
ns in
mill
ions
per
sec
ond
(log2
)
Independent threads
●●
●●
● ●● ● ● ●
●
●
●●
● ●
●
●●
● ● ● ● ● ● ●
●●
● ● ●●
● ●
●● ●
Matching, 2 − AMD Opteron 6272s Matching, 2 − Intel Xeon E5405s Matching, 4 − AMD Opteron 6134s Matching, Intel Xeon X5680
1
2
3
4
5
3
4
5
6
1
2
3
5
10
15
10 20 30 3 4 5 6 7 8 10 20 30 5 10 15 20 25Independent threads
Ope
ratio
ns in
milli
ons
per s
econ
d
Queue●
Flat−Combining queueLCRQ−32bit
Michael and Scott queueNew − Blocking Enq&Deq
New − Non−waiting Deq, Blocking EnqNew − Non−waiting Enq&Deq
Tsigas and Zhang queue
LCRQ drops to 2,700 ops/second
33,043.91 times more ops with blocking than LCRQ in
this case
Lawrence Livermore National Laboratory LLNL-PRES-666776 56
! Designing concurrent data-structures for throughput is important in modern architectures
! CAS can be dangerous with enough threads ! Our queue shows between a 1.5x and 1000x
speedup over state of the practice for many-core architectures
! Allowing blocking can be beneficial!
Conclusions