Download - Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas

Concurrent Data Structures in Architectures with

Limited Shared Memory Support

Ivan WalulyaYiannis NikolakopoulosMarina Papatriantafilou

Philippas Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

Yiannis Nikolakopoulos [email protected]

2

Concurrent Data Structures• Parallel/Concurrent programming:– Share data among threads/processes,

sharing a uniform address space (shared memory)

• Inter-process/thread communication and synchronization– Both a tool and a goal


3

Concurrent Data Structures:Implementations

• Coarse grained locking– Easy but slow...

• Fine grained locking– Fast/scalable but: error-prone, deadlocks

• Non-blocking– Atomic hardware primitives (e.g. TAS, CAS)– Good progress guarantees (lock/wait-freedom)– Scalable


4

What’s happening in hardware?• Multi-cores many-cores– “Cache coherency wall”

[Kumar et al 2011]– Shared address space

will not scale– Universal atomic primitives (CAS, LL/SC) harder to

implement• Shared memory message passing

Cache Cache

IA Core

Shared Local


5

• Networks on chip (NoC)• Short distance

between cores• Message passing

model support• Shared memory support

Can we have Data Structures:Fast

ScalableGood progress guarantees

Cache Cache

IA Core

Shared Local

• Eliminatedcache coherency

• Limited support for synchronization primitives


6

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion


7

Single-chip Cloud Computer (SCC)• Experimental processor by Intel• 48 independent x86 cores arranged on 24 tiles• NoC connects all tiles• TestAndSet register

per core


8

SCC: Architecture Overview

Memory Controllers:to private & shared

main memory

Message Passing

Buffer (MPB) 16Kb


9

Programming Challenges in SCC• Message Passing but…– MPB small for

large data transfers– Data Replication is difficult

• No universal atomic primitives (CAS); no wait-free implementations [Herlihy91]


10



11

Concurrent FIFO Queues• Main idea:– Data are stored in shared off-chip memory– Message passing for communication/coordination

• 2 design methodologies:– Lock-based synchronization (2-lock Queue)– Message passing-based synchronization

(MP-Queue, MP-Acks)


12

2-lock Queue• Array based, in shared off-chip memory (SHM)• Head/Tail pointers in MPBs• 1 lock for each pointer [Michael&Scott96]• TAS based locks on 2 cores


13

2-lock Queue:“Traditional” Enqueue Algorithm

• Acquire lock• Read & Update

Tail pointer (MPB)• Add data (SHM)• Release lock


14

2-lock Queue:Optimized Enqueue Algorithm


Tail pointer (MPB)• Release lock• Add data to node SHM• Set memory flag to dirty Why?

No Cache Coherency!


15

2-lock Queue:Dequeue Algorithm


Head pointer• Release lock• Check flag• Read node dataWhat about

progress?


16

2-lock Queue:Implementation

Head/TailPointers (MPB)

Data nodes

Locks?On which tile(s)?


17

Message Passing-based Queue• Data nodes in SHM• Access coordinated by a Server node who

keeps Head/Tail pointers• Enqueuers/Dequeuers request access through

dedicated slots in MPB• Successfully enqueued data are flagged with

dirty bit


18

MP-Queue

ENQ

TAIL

DEQ

HEAD

SPIN

What if this fails and is

never flagged?“Pairwise blocking”

only 1 dequeue blocks

ADDDATA


19

Adding Acknowledgements• No more flags!

Enqueue sends ACK when done• Server maintains in SHM a private queue of

pointers• On ACK:

Server adds data location to its private queue• On Dequeue:

Server returns only ACKed locations


20

MP-Acks

ENQ

TAIL

ACK

DEQ

HEAD

No blocking between

enqueues/dequeues


21



22

Evaluation

Benchmark:• Each core performs Enq/Deq at random• High/Low contention

• Perfomance? Scalability?• Is it the same for all cores?

23

• Throughput:Data structure operations completed per time unit.

[Cederman et al 2013]

Measures

Yiannis [email protected]

Operations by core i

Average operations per

core


24

Throughput – High Contention


25

Fairness – High Contention


26

Throughput VS Lock Location


27

Throughput VS Lock Location


28

Conclusion• Lock based queue– High throughput– Less fair– Sensitive to lock locations, NoC performance

• MP based queues– Lower throughput– Fairer– Better liveness properties– Promising scalability


29

Thank you!

[email protected]@chalmers.se

mailto:[email protected]

mailto:[email protected]


30

BACKUP SLIDES


31

Experimental Setup• 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations• High/Low contention• One thread per core• 600ms per execution • Averaged over 12 runs


32

Concurrent FIFO Queues• Typical 2-lock queue [Michael&Scott96]

Download - Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas

Top Related