implementing batcher’s bitonic sort in c++: an investigation into using library-based,...

Implementing Batcher’s Bitonic Sort in C++An Investigation into using Library-Based, Data-Parallelism

Support in C++.October 2011

Jason McGuiness, Colin Egan & Russel Winder

[email protected]@[email protected]

http://libjmmcg.sf.net/

October 5, 2011

http://libjmmcg.sf.net/

Sequence of Presentation.

I An introduction: why I am here.I How do we manage all these threads:

I libraries, languages and compilers.

I Very brief introduction of parallel sorting algorithms.I An outline of Parallel Pixie Dust (PPD).I Outline of implementation using PPD.I Conclusion.I Future directions.

Introduction.

Why yet another thread-library presentation?I Because we still find it hard to write multi-threaded programs

correctly.I According to programming folklore.

I We haven’t successfully replaced the von Neumannarchitecture:

I Stored program architectures are still prevalent.

I The memory wall still affects us:I The CPU-instruction retirement rate, i.e. rate at which

programs require and generate data, exceeds the the memorybandwidth - a by product of Moore’s Law.

I Modern architectures add extra cores to CPUs, in thisinstance, extra memory buses which feed into those cores.

A Quick Review of Some Threading Models:

I Restrict ourselves to library-based techniques.I Raw library calls.I The “thread as an active class”.I Co-routines.I The “thread pool” containing those objects.

I Producer-consumer models are a sub-set of this model.

I The issue: classes used to implement business logic alsoimplement the operations to be run on the threads. This oftenmeans that the locking is intimately entangled with thoseobjects, and possibly even the threading logic.

I This makes it much harder to debug these applications. Thequestion of how to implement multi-threaded debuggerscorrectly is still an open question.

Parallel Sorting...

Parallel sorting has a history:I The three Hungarians (Ajtai, Komlos & Szemeredi, AKS)

presented an algorithm that takes O (log (n)) time with nprocessors in 1983, but has impractically large complexityconstants.

I Cole’s parallel merge sort was presented in 1986 and takesO (log (n)) time with n processors, but has much smallercomplexity constants.

I Batcher’s bitonic sort was presented in 1968 and takesO(log2 (n)

)time, but according to Knuth when comparing it

to AKS: "Batcher’s method is much better, unless n exceedsthe total memory capacity of all computers on earth!”

I Research has indicated that Batcher’s algorithm has bettercomplexity constants than Cole’s, so we shall focus onBatcher’s algorithm.

An Examination of SPEC2006 for STL Usage.

SPEC2006 was examined for usage of STL algorithms used withinit, sorting is important to one of the benchmarks, so a potentialtarget for parallelism.

for_each

swap

find

find_if

copy

reverse

remove

transform

stable_sort

sort

replace

fill

count

0

20

40

60

80

100

Distribution of algorithms in 483.xalancbmk.

set::countmap::countstring::findset::findmap::findGlobal algorithm

Algorithm.

Nu

mb

er

of

oc

cu

ren

ce

s.

copy fill_n

fill swap

sort count

find lower_bound

count_if

max_elem

ent

accumulate

upper_bound

min_elem

ent

unique

inner_product

remove_if

transform

nth_element

inplace_merg e

for_each

find_if

equal

binary_search

adjacent_find

find

0

20

40

60

80

100

Distribution of algorithms in 477.dealII.

set::countmap::countstring::findset::findmap::findGlobal algorithm

Algorithm.

Nu

mb

er

of

oc

cu

ren

ce

s.

Part 1: Design Features of PPD.

The main design features of PPD are:I It targets general purpose threading using a data-flow model of

parallelism:I This type of scheduling may be viewed as dynamic scheduling

(run-time) as opposed to static scheduling (potentiallycompile-time), where the operations are statically assigned toexecution pipelines, with relatively fixed timing constraints.

I DSEL implemented as futures and thread pools (of manydifferent types using traits).

I Can be used to implement a tree-like thread schedule.I “thread-as-an-active-class” exists.I Gives rise to important properties: efficient, dead-lock &

race-condition free schedules.

Part 2: Design Features of PPD.

Extensions to the basic DSEL have been added:I Certain binary functors: each operand is executed in parallel.I Adaptors for the STL collections to assist with thread-safety.

I Combined with thread pools allows replacement of the STLalgorithms.

I Implemented GSS(k), or baker’s scheduling:I May reduce synchronisation costs on thread-limited systems or

pools in which the synchronisation costs are high.

I Amongst other influences, PPD was born out of discussionswith Richard Harris and motivated by Kevlin Henney’spresentation to ACCU’04 regarding threads.

The STL-style Parallel Algorithms.

That analysis, amongst other reasons lead me to implement inPPD:1. for_each()

2. find_if() & find()

3. count_if() & count()

4. transform() (both overloads)5. copy()

6. accumulate() (both overloads)7. fill_n() & fill()

8. reverse()

9. max_element() & min_element() (both overloads)10. merge() & sort() (both overloads)

Basic Example of sort().

Listing 1: Parallel sort() with a Thread Pool and Future.t y p ed e f ppd : : s a f e_co l l n <

vec to r <long >, l o c k : : r e c u r s i v e_c r i t i c a l_ s e c t i o n_ l o c k_ t y p e> vt r_co l l n_t ;

v t r_co l l n_t v ;v . push_back ( 2 ) ; v . push_back ( 1 ) ;e x e cu t i on_con t ex t c on t e x t (

pool<<j o i n a b l e ()<<poo l . s o r t <ascend ing >(v )) ;∗ con t e x t ;

I To assist in making the code appear to be readable, a numberof typedefs have been omitted.

I The safe_colln adaptor provides the lock for the collection,I in this case a simple EREW lock,I only locks the collection, not the elements in it.

I The sort() returns an execution_context:I The input collection has a read-lock taken on it.I Released when the asynchronous sort has completed, when the

read-lock has been dropped.I It makes no sense to return a random iterator.

The typedefs used.

Listing 2: typedefs in the sort() example.t y p ed e f ppd : : thread_pool<

p o o l_ t r a i t s : : worker_threads_get_work , p o o l_ t r a i t s : : f i x e d_s i z e ,pool_adaptor<

g e n e r i c _ t r a i t s : : j o i n a b l e , p la t fo rm_ap i , heavywe ight_thread ing ,p o o l_ t r a i t s : : no rma l_f i f o , s t d : : l e s s , 1

>> pool_type ;t y p ed e f poo l_type : : vo id_exec_cxt execu t i on_con t ex t ;t y p ed e f e x e cu t i on_con t ex t : : j o i n a b l e j o i n a b l e ;

I Expresses the flexibility & optimizations available:I pool_traits::prioritised_queue: specifies work should

be ordered by the trait std::less.I Could be used to order work by an estimate of time taken to

complete - very interesting, but not implemented.

I GSS(k) scheduling implemented: number specifies batch size.

Implementing Batcher’s Bitonic Sort with PPD.

Listing 3: Schematic of the sort() method.con s t i n _ i t e r a t o r midd l e ( boos t : : nex t ( beg in , ha l f_s i z e_o f_po r t i on ) ) ;e x e cu t i on_con t ex t so r t_ lh s_ascend ing (

pool<<j o i n a b l e ()<<so r t <ascend ing >(beg in , middle , compare )) ;e x e cu t i on_con t ex t so r t_rhs_descend ing (

pool<<j o i n a b l e ()<<so r t <descend ing >(middle , end , compare )) ;∗ so r t_ lh s_ascend ing ;∗ so r t_rhs_descend ing ;e x e cu t i on_con t ex t b i t on i c_merge_a l l (

poo l<<j o i n a b l e ()<<merge<d i r e c t i o n >(beg in , end , compare )) ;∗ b i t on i c_merge_a l l ;

I As the target processor may be more sophisticated than acomparator, we could use a serial sort for a minimal sub-range,based upon threading costs of the target architecture,terminating the recursion early.

I We’d avoid using quicksort, as it is O(n2)for nearly-sorted

inputs, which a bitonic sequence is.

Implementing Batcher’s Bitonic Merge with PPD.

Listing 4: Schematic of the merge() method.con s t o u t_ i t e r a t o r midd l e ( boos t : : nex t ( beg in , ha l f_s i z e_o f_po r t i on ) ) ;con s t o u t_ i t e r a t o r output ( midd l e ) ;e x e cu t i on_con t ex t s h u f f l e (

pool<<j o i n a b l e ()<<poo l . swap_ranges (beg in , middle , output , swap_pred<d i r e c i o n >(compare )

)) ;∗ s h u f f l e ;e x e cu t i on_con t ex t merge_lhs (

pool<<j o i n a b l e ()<<merge<d i r e c t i o n >(beg in , middle , compare )) ;e x e cu t i on_con t ex t merge_rhs (

pool<<j o i n a b l e ()<<merge<d i r e c t i o n >(middle , end , compare )) ;∗merge_lhs ;∗merge_rhs ;

I Again the target processor may be more sophisticated than asimple comparator, we could use a serial sort in a similarmanner.

I Although this might increase the complexity with anO (n log (n)) factor.

Some Thoughts Arising from the Examples.

I Internally PPD creates sub-tasks that recursively divide thecollections into p equal-length sub-sequences, where p is thenumber of threads in the pool, generating a tree of depthO (log (min (p, n))).

I The comparison operations on the pairs of elements should bethread-safe & must not affect the validity of iterators on thosecollections.

I The user may call many STL-style algorithms in succession,each one placing work into the thread pool to be operatedupon, and later recall the results or note the completion ofthose operations via the returned execution_contexts.

Sequential Operation.

Recall the thread_pool typedef that contained the API andthreading model:

Listing 5: Definitions of the typedef for the thread pool.t y p ed e f ppd : : thread_pool<

p o o l_ t r a i t s : : worker_threads_get_work , p o o l_ t r a i t s : : f i x e d_s i z e ,g e n e r i c _ t r a i t s : : j o i n a b l e , p la t fo rm_ap i , heavywe ight_thread ing ,p o o l_ t r a i t s : : no rma l_f i f o , s t d : l e s s , 1

> pool_type ;

To make the thread_pool and therefore all operations on itsequential, and remove all threading operations, one merely needsto replace the trait heavyweight_threading with the traitsequential.

I It is that simple, no other changes need be made to the clientcode, so all threading-related bugs have been removed.

I This makes debugging significantly easier by separating thebusiness logic from the threading algorithms.

Other Features.

Able to provide estimates of:I The optimal number of threads required to complete the

current work-load.I The minimum time to complete the current work-load given

the current number of threads.This could provide:

I An indication of soft-real-time performance.I Furthermore this could be used to implement dynamic tuning

of the thread pools.

Conclusions.

I A study of the scalability of the sort & merge operations wouldbe interesting - this is to be done.

I I believe the algorithmic complexity of the merge() andsort() operations is O

(log2

(n log(n)

p + log (p)))

.

I PPD assists with creating efficient, deadlock andrace-condition free schedules.

I The business logic and parallel algorithms are sufficientlyseparated to allow relatively simple, independent testing anddebugging of each.

I All this in a traditionally hard-to-thread imperative language,such as C++, indicating that it is possible to re-implementsuch an example library in many other programming languages.

These are, in my opinion, rather useful features of the library!

implementing batcher’s bitonic sort in c++: an investigation into using library-based,...

Documents