introducing parallel pixie dust: advanced library-based support for parallelism in c++

Introducing Parallel Pixie Dust:Advanced Library-Based Support for Parallelism in C++

BCS-DSG April 2011

Jason McGuiness, Colin Egan

[email protected]@herts.ac.uk

http://libjmmcg.sf.net/

April 7, 2011

http://libjmmcg.sf.net/

Sequence of Presentation.

I An introduction: why I am here.I Why do we need multiple threads?

I Multiple processors, multiple cores....

I How do we manage all these threads:I libraries, languages and compilers.

I My problem statement, an experiment if you will...I An outline of Parallel Pixie Dust (PPD).I Important theorems.I Examples, and their motivation.I Conclusion.I Future directions.

Introduction.

Why yet another thread-library presentation?I Because we still find it hard to write multi-threaded programs

correctly.I According to programming folklore.

I We haven’t successfully replaced the von Neumannarchitecture:

I Stored program architectures are still prevalent.

I The memory wall still affects us:I The CPU-instruction retirement rate, i.e. rate at which

programs require and generate data, exceeds the the memorybandwidth - a by product of Moore’s Law.

I Modern architectures add extra cores to CPUs, in thisinstance, extra memory buses which feed into those cores.

How many Threads?

The future appears to be laid out: inexorably more execution units.I For example quad-core processors are becoming more popular.

I 4-way blade frames with quad cores provide 16 execution unitsin close proximity.

I picoChip (from Bath, U.K.) has hundreds of execution unitsper chip, with up to 16 in a network.

I Future supercomputers will have many millions of executionunits, e.g. IBM BlueGene/C Cyclops64.

A Quick Review of Some Threading Models:

I Restrict ourselves to library-based techniques.I Raw library calls.I The “thread as an active class”.I Co-routines.I The “thread pool” containing those objects.

I Producer-consumer models are a sub-set of this model.

I The issue: classes used to implement business logic alsoimplement the operations to be run on the threads. This oftenmeans that the locking is intimately entangled with thoseobjects, and possibly even the threading logic.

I This makes it much harder to debug these applications. Thequestion of how to implement multi-threaded debuggerscorrectly is still an open question.

What Is Parallel Pixie Dust (PPD)?

I decided to set myself a challenge, an experiment if you will, whichmay be summarized in this problem statement:

I Is it possible to create a cross-platform thread library that canguarantee a schedule that is deadlock & race-condition free, isefficient, and also assists in debugging the business logic, in ageneral-purpose, imperative language that has little or nonative threading support? Moreover the library should offerconsiderable flexibility, e.g. reflect the various types ofthread-related costs and techniques.

Part 1: Design Features of PPD.

The main design features of PPD are:I It targets general purpose threading using a data-flow model of

parallelism:I This type of scheduling may be viewed as dynamic scheduling

(run-time) as opposed to static scheduling (potentiallycompile-time), where the operations are statically assigned toexecution pipelines, with relatively fixed timing constraints.

I DSEL implemented as futures and thread pools (of manydifferent types using traits).

I Can be used to implement a tree-like thread schedule.I “thread-as-an-active-class” exists.

Part 2: Design Features of PPD.

Extensions to the basic DSEL have been added:I Certain binary functors: each operand is executed in parallel.I Adaptors for the STL collections to assist with thread-safety.

I Combined with thread pools allows replacement of the STLalgorithms.

I Implemented GSS(k), or baker’s scheduling:I May reduce synchronisation costs on thread-limited systems or

pools in which the synchronisation costs are high.

I Amongst other influences, PPD was born out of discussionswith Richard Harris and motivated by Kevlin Henney’spresentation to ACCU’04 regarding threads.

More Details Regarding the Futures.

The use of futures, termed execution contexts, within PPD iscrucial:

I This hides the data-related locks from the user, as the futurewraps the retrieval of the data with a hidden lock, theresultant future-like object behaving like a proxy.

I This is an implementation of the split-phase constraint.I They are only stack-based, cannot be heap-allocated and only

const references may be taken.I This guarantees that there can be no aliassing of these objects.

This provides extremely useful guarantees that will be usedlater.

I They can only be created from the returned object when workis transferred into the pool.

Part 1: The Thread Pools.

Primary method of controlling threads is thread pools, available inmany different types:

I Work-distribution models: e.g. master-slave, work-stealing.I Size models: e.g. fixed size, unlimited.....I Threading implementations: e.g. sequential, PThreads, Win32:

I The sequential trait means all threading within the library canbe removed, but maintaining the interface. One simplerecompilation and all the bugs left are yours, i.e. your businessmodel code.

I A separation of design constraints. Also side-steps thedebugger issue!

Part 2: The Thread Pools.

I Contains a special queue that contains the work in strict FIFOorder or user-defined priority order.

I Threading models: e.g. the cost of performing threading, i.e.heavyweight like PThreads or Win32.

I Further assist in managing OS threads and exceptions acrossthe thread boundaries via futures.

I i.e. a PRAM programming model.

Exceptions.

Passing exceptions across the thread boundary is a thorny issue:I Futures are used to receive any exceptions that may be thrown

by the mutation when it is executed within the thread pool.I If the data passed back from a thread-pool is not retrieved

then the future may throw that exception from its destructor!I The design idea expressed is that the exception is important,

and if something might throw, then the execution contextmust be examined to verify that the mutation executed in thepool succeeded.

Ugly Details.

I All the traits make the objects look quite complex.I typedefs are used so that the thread pools and execution

contexts are aliased to more convenient names.

I It requires undefined behaviour or broken compilers, forexample:

I Exceptions and multiple threads are unspecified in the currentstandard. Fortunately most compilers “do the right thing” andimplement an exception stack per thread, but this is notstandardized.

I Currently, most optimizers in compilers are broken with regardto code hoisting. This affects the use of atomic counters andpotentially intrinsic locks having code that they are guardinghoisted around them.

I volatile does not solve this, because volatile has been toopoorly specified to rely upon. It does not specify thatoperations on a set of objects must come after an operationon an unrelated object.

An Examination of SPEC2006 for STL Usage.

SPEC2006 was examined for usage of STL algorithms used withinit to guide implementation of algorithms within PPD.

for_each

swap

find

find_if

copy

reverse

remove

transform

stable_sort

sort

replace

fill

count

0

20

40

60

80

100

Distribution of algorithms in 483.xalancbmk.

set::countmap::countstring::findset::findmap::findGlobal algorithm

Algorithm.

Nu

mb

er

of

oc

cu

ren

ce

s.

copy fill_n

fill swap

sort count

find lower_bound

count_if

max_elem

ent

accumulate

upper_bound

min_elem

ent

unique

inner_product

remove_if

transform

nth_element

inplace_merg e

for_each

find_if

equal

binary_search

adjacent_find

find

0

20

40

60

80

100

Distribution of algorithms in 477.dealII.

set::countmap::countstring::findset::findmap::findGlobal algorithm

Algorithm.

Nu

mb

er

of

oc

cu

ren

ce

s.

The STL-style Parallel Algorithms.

That analysis, amongst other reasons lead me to implement:1. for_each()

2. find_if() and find()

3. count_if() and count()

4. transform() (both overloads)5. copy()

6. accumulate() (both overloads)

Technical Details, Theorems.

Due to the restricted properties of the execution contexts and thethread pools a few important results arise:1. The thread schedule created is only an acyclic, directed graph.

In fact it is a tree.2. From this property I have proved that the schedule PPD

generates is deadlock and race-condition free.3. Moreover in implementing the STL-style algorithms those

implementations are efficient, i.e. there are provable boundson both the execution time and minimum number ofprocessors required to achieve that time.

The Key Theorems: Deadlock & Race-Condition Free.

1. Deadlock Free:

TheoremThat if the user refrains from using any other threading-relateditems or atomic objects other than those provided by PPD, forexample those in 19 and 21, then they can be guaranteed to have aschedule free of deadlocks.2. Race-condition Free:

TheoremThat if the user refrains from using any other threading-relateditems or atomic objects other than those provided by PPD, forexample those in 19 and 21, and that the work they wish to mutatemay not be aliased by another object, then the user can beguaranteed to have a schedule free of race conditions.

The Key Theorems: Efficient Schedule with Futures.

TheoremIn summary, if the user only uses the threading-related itemsprovided by the execution contexts, then the schedule of worktransferred to PPD will be executed in an algorithmic order that isO

(np + log (p)

), where n is the number of tasks and p the size of

the thread pool. In those cases PPD will add at most a constantorder to the execution time of the schedule.

I Note that the user might implement code that orders thetransfers of work in a sub-optimal manner, PPD has verylimited abilities to automatically improve such schedules.

I This theorem states that once the work has been transferred,PPD will not add any further sub-efficient algorithmic delays,but might add some constant-order delay.

Basic Example using execution_contexts.

Listing 1: General-Purpose use of a Thread Pool and Future.s t r u c t res_t {

i n t i ;} ;s t r u c t work_type {

vo i d p r o c e s s ( res_t &) {}} ;pool_type poo l ( 2 ) ;e x e cu t i on_con t ex t c on t e x t ( pool<<j o i n a b l e ()<< t i m e _ c r i t i c a l ()<<work_type ( ) ) ;poo l . e r a s e ( con t e x t ) ;contex t−>i ;

I To assist in making the code appear to be readable, a numberof typedefs have been omitted.

I The execution_context is created from adding the wrappedwork to the pool.

I process(res_t &) is the only invasive artifact of the libraryfor this use-case.

I Alternative member functions may be specified, if a morecomplex form is used.

I Compile-time check: the type of the transferred work is thesame as that declared in the execution_context.

The typedefs used.

Listing 2: typedefs in the execution context example.t y p ed e f ppd : : thread_pool<

p o o l_ t r a i t s : : worker_threads_get_work , p o o l_ t r a i t s : : f i x e d_s i z e ,pool_adaptor<

g e n e r i c _ t r a i t s : : j o i n a b l e , p la t fo rm_ap i , heavywe ight_thread ing ,p o o l_ t r a i t s : : no rma l_f i f o , s t d : : l e s s , 1

>> pool_type ;t y p ed e f poo l_type : : c r ea t e1 <work_type >: : c r e a to r_t ;t y p ed e f c r e a to r_t : : e x e cu t i on_con t ex t e x e cu t i on_con t ex t ;t y p ed e f c r e a to r_t : : j o i n a b l e j o i n a b l e ;t y p ed e f poo l_type : : p r i o r i t y <api_params_type : : t im e_c r i t i c a l > t i m e _ c r i t i c a l ;

I Expresses the flexibility & optimizations available:I pool_traits::prioritised_queue: specifies work should

be ordered by the trait std::less.I Could be used to order work by an estimate of time taken to

complete - very interesting, but not implemented.I GSS(k) scheduling implemented: number specifies batch size.

I execution_context checks the argument-type ofprocess(...) at compile-time,

I also if const, i.e. pure, this could be used to implementmemoizing - also very interesting, but not implemented.

Example using for_each().

Listing 3: For Each with a Thread Pool and Future.t y p ed e f ppd : : s a f e_co l l n <

vec to r <in t >, l o c k_ t r a i t s : : r e c u r s i v e_c r i t i c a l_ s e c t i o n_ l o c k_ t y p e> vt r_co l l n_t ;t y p ed e f poo l_type : : vo id_exec_cxt execu t i on_con t ex t ;s t r u c t accumulate {

s t a t i c i n t l a s t ; // for_each needs some k ind o f s i d e−e f f e c t . . .v o i d op e r a t o r ( ) ( i n t t ) { l a s t+=t ; } ;

} ;

v t r_co l l n_t v ;v . push_back ( 1 ) ; v . push_back ( 2 ) ;e x e cu t i on_con t ex t c on t e x t ( pool<<j o i n a b l e ()<<poo l . for_each ( v , accumulate ( ) ) ;∗ con t e x t ;

I The safe_colln adaptor provides the lock for the collection,I in this case a simple EREW lock,I only locks the collection, not the elments in it.

I The for_each() returns an execution_context:I The input collection has a read-lock taken on it.I Released when all of the asynchronous applications of the

functor have completed, when the read-lock has been dropped.I It makes no sense to return a random copy of the input functor.

Example using transform().

Listing 4: Transform with a Thread Pool and Future.t y p ed e f ppd : : s a f e_co l l n <

vec to r <long >, l o c k : : rw<l o c k_ t r a i t s >: : decay ing_wr i t e ,n o_s i g n a l l i n g , l o c k : : rw<l o c k_ t r a i t s >: : r ead

> vtr_out_co l ln_t ;

v t r_co l l n_t v ; vt r_out_co l ln_t v_out ;v . push_back ( 1 ) ; v . push_back ( 2 ) ;poo l_type : : vo id_exec_cxt con t e x t (

pool<<j o i n a b l e ()<<poo l . t r an s f o rm (v , v_out , s t d : : negate<vt r_co l l n_t : : va lue_type >()

)) ;∗ con t e x t ;

I vtr_out_colln_t provides a CREW lock on v_out: it isre-sized, then the write-lock atomically decays to a read-lock.

I Note how the collection is locked, but not the data elementsthemselves.

I The transform() returns an execution_context:I Released when all of the asynchronous applications of the

unary operation have completed & the read-locks have beendropped.

I Similarly, it makes no sense to return a random iterator.

Example using accumulate().

Listing 5: Accumulate with a Thread Pool and Future.t y p ed e f poo l_type : : accumulate_t<

vt r_co l l n_t>:: e x e cu t i on_con t ex t e x e cu t i on_con t ex t ;

v t r_co l l n_t v ;v . push_back ( 1 ) ; v . push_back ( 2 ) ;e x e cu t i on_con t ex t c on t e x t (

poo l . accumulate ( v , 1 , s t d : : p lu s<v t r_co l l n_t : : va lue_type >())) ;∗ con t e x t ;

I The accumulate() returns an execution_context:I Released when all of the asynchronous applications of the

binary operation have completed & the read-lock is dropped.

I Note the use of the accumulate_t type: it contains aspecialized counter that atomically accrues the result usingsuitable locking according to the API and type.

I This is effectively a map-reduce operation.I find(), find_if(), count() and count_if() are also

implemented, but omitted, for brevity.

Example using logical_or().

Listing 6: Logical Or with a Thread Pool and a Future.s t r u c t bool_work_type {

vo i d p r o c e s s ( boo l &) cons t {}} ;pool_type poo l ( 2 ) ;poo l_type : : bool_exec_cxt con t e x t (

poo l . l o g i c a l_ o r ( bool_work_type ( ) , bool_work_type ( ) )) ;∗ con t e x t ;

I The logical_or() returns an execution_context:I Each argument is asynchronously computed, independent of

each other:I the compiler attempts to verify that they are pure, with no

side-effects (they could be memoized),I no short-circuiting, so breaks the C++ Standard.

I Once the results are available, the execution_context isreleased.

I logical_and() has been equivalently implemented.

Example using boost::bind.

Listing 7: Boost Bind with a Thread Pool and a Future.s t r u c t res_t {

i n t i ;} ;s t r u c t work_type {

res_t exec ( ) { r e t u r n res_t ( ) ; }} ;pool_type poo l ( 2 ) ;e x e cu t i on_con t ex t c on t e x t (

pool<<j o i n a b l e ()<<boos t : : b ind (&work_type : : exec , work_type ( ) )) ;contex t−>i ;

I Boost is a very important C++ library: transferringboost::bind() is supported, returning execution_context:

I No unbound arguments nor placeholders in the call toboost::bind().

I Result-type of the function call should be copy-constructible.I Contrast with the reference argument result-type, earlier.

I The cost: stack-space for the function-pointer & boundarguments.

I Completion of mutation releases the execution_context.

Some Thoughts Arising from the Examples.

I The read-locks on the collections assist the user in avoidingoperations that might invalidate iterators on those collections.

I Internally PPD creates tasks that recursively divide thecollections into p equal-length sub-sequences, where p is thenumber of threads in the pool, at most O (log (min (p, n))).

I The operations on the elements should be thread-safe & mustnot affect the validity of iterators on those collections.

I The user may call many STL-style algorithms in succession,each one placing work into the thread pool to be operatedupon, and later recall the results or note the completion ofthose operations via the returned execution_contexts.

Sequential Operation.

Recall the thread_pool typedef that contained the API andthreading model:

Listing 8: Definitions of the typedef for the thread pool.t y p ed e f ppd : : thread_pool<

p o o l_ t r a i t s : : worker_threads_get_work , p o o l_ t r a i t s : : f i x e d_s i z e ,g e n e r i c _ t r a i t s : : j o i n a b l e , p la t fo rm_ap i , heavywe ight_thread ing ,p o o l_ t r a i t s : : no rma l_f i f o , s t d : l e s s , 1

> pool_type ;

To make the thread_pool and therefore all operations on itsequential, and remove all threading operations, one merely needsto replace the trait heavyweight_threading with the traitsequential.

I It is that simple, no other changes need be made to the clientcode, so all threading-related bugs have been removed.

I This makes debugging significantly easier by separating thebusiness logic from the threading algorithms.

Other Features.

Other parallel algorithms implemented:I unary_fun() and binary_fun() which would be used to

execute the operands in parallel, in a similar manner to thesimilarly-named STL binary functions.

I Examples omitted for brevity.

Able to provide estimates of:I The optimal number of threads required to complete the

current work-load.I The minimum time to complete the current work-load given

the current number of threads.This could provide:

I An indication of soft-real-time performance.I Furthermore this could be used to implement dynamic tuning

of the thread pools.

Conclusions.

Recollecting my original problem statement, what has beenachieved?

I A library that assists with creating deadlock and race-conditionfree schedules.

I Given the typedefs, a procedural way to automatically convertSTL algorithms into efficient threaded ones.

I The ability to add to the business logic parallel algorithms thatare sufficiently separated to allow relatively simple debuggingand testing of the business logic independently of the parallelalgorithms.

I All this in a traditionally hard-to-thread imperative language,such as C++, indicating that it is possible to re-implementsuch an example library in many other programming languages.

These are, in my opinion, rather useful features of the library!

introducing parallel pixie dust: advanced library-based support for parallelism in c++

Documents