outline introduction sorting networks bubble sort and its variants 2
TRANSCRIPT
3
Introduction
Sorting is the most common operations performed by a computer
Internal or external
Comparison-based Θ(nlogn) and non comparison-based Θ(n)
4
background
Where the input and output sequence are stored?stored on one processdistributed among the process
○ Useful as an intermediate step
What’s the order of output sequence among the processes?Global enumeration
5
How comparisons are performed Compare-exchange is not easy in parallel
sorting algorithms One element per process
Ts+Tw, Ts>>Tw => poor performance
6
How comparisons are performed (contd’)
More than one element per processn/p elements, Ai <= AjCompare-split, (ts+tw*n/p)=> Ɵ(n/p)
7
Outline
introduction Sorting Networks
Bitonic sortMapping bitonic sort to hypercube and mesh
Bubble Sort and its Variants
9
A typical sorting network Depth: the number of columns it contains
Network speed is proportional to it
10
Bitonic sort: Ɵ(log2n) Bitonic sequence <a0,a1,…,an>
Monotonically increasing then decreasing There exists a cyclic shift of indices so that the above satisfied EG: 8 9 2 1 0 4 5 7
How to rearrange a bitonic sequence to obtain a monotonic sequence? Let s= <a0,a1,…,an> is a bitonic sequence
s1 ,s2 are bitonic
every element of s1 are smaller than every element of s2
Bitonic-split; bitonic-merge=>bitonic-merging network or
13
Sorting n unordered elements Bitonic sort, bitonic-sorting network d(n)=d(n/2)+logn => d(n)=Θ(log2n)
15
How to map Bitonic sort to a hypercube ?
One element per process How to map the bitonic sort algorithm on general
purpose parallel computer? Process <=> a wire Compare-exchange function is performed by a pair of
processes Bitonic is communication intensive=> considering the
topology of the interconnection network○ Poor mapping => long distance before compare, degrading
performance
Observation: Communication happens between pairs of wire which
have 1 bit different
21
A block of elements per process case Each processor has n/p elements
S1: Think of each process as consisting of n/p smaller processes○ Poor parallel implementation
S2: Compare-exchange=> compare-split:Θ(n/p)+Θ(n/p)The different: S2 initially sorted locallyHypercube
mesh
22
Performance on different Architecture
Either very efficient nor very scalable, since the sequential algorithm is sub optimal
28
Shellsort
Drawback of odd-even sortA sequence which has a few elements out of
order, still need Θ(n2) to sort. idea
Add a preprocessing phase, moving elements across long distance
Thus reduce the odd and even phase
30
Conclusion
Sorting NetworksBitonic networkMapping to hypercube and mesh
Bubble Sort and its VariantsOdd-even sortShell sort
32
Outline
Issues in Sorting Sorting Networks Bubble Sort and its Variants
Quick sort Bucket and Sample sort Other sorting algorithms
33
Quick Sort
FeatureSimple, low overheadΘ(nlogn) ~ Θ(n2),
IdeaChoosing a pivot, how? Partitioning into two parts, Θ(n)Recursively solving two sub-problems
complexityT(n)=T(n-1)+ Θ(n)=> Θ(n2)T(n)=T(n/2)+ Θ(n)=> Θ(nlogn)
35
Parallelizing quicksort Solution 1
Recursive decompositionDrawback: partition handled by single process,
Ω(n). Ω(n2) Solution 2
Idea: performing partition parallelly we could partition an array of size n into two smaller
arrays in time Θ(1) by using Θ(n) processes○ how?○ CRCW PRAM, Shard-address, message-passing
model
36
Parallel Formulation for CRCW PRAM –cost optimal assumption
n elements, n process write conflicts are resolved arbitrarily Executing quicksort can be visualized as constructing a
binary tree
38
algorithm1. procedure BUILD TREE (A[1...n]) 2. begin 3. for each process i do 4. begin 5. root := i; 6. parenti := root; 7. leftchild[i] := rightchild[i] := n + 1; 8. end for 9. repeat for each process i ≠ root do 10. begin 11. if (A[i] < A[parenti]) or (A[i]= A[parenti] and i <parenti) then 12. begin 13. leftchild[parenti] :=i ; 14. if i = leftchild[parenti] then exit 15. else
parenti := leftchild[parenti]; 16. end for 17. else 18. begin 19. rightchild[parenti] :=i; 20. If i = rightchild[parent i] then exit 21. else
parenti := rightchild[parenti]; 22. end else 23. end repeat 24. end BUILD_TREE
Assuming balanced tree:•Partition distributeTo all process O(1)•Θ(logn) * Θ(1)
39
Parallel Formulation for Shared-Address-Space Architecture assumption
N element, p processes Shared memory
How to parallelize? Idea of the algorithm
Each process is assigned a block Selecting a pivot element, broadcast Local rearrangement Global rearrangement=> smaller block S, larger block L redistributing blocks to processes
○ How many? Until breaking the array into p parts
43
Analysis
AssumptionPivot selection results in balanced partitions
Logp stepsBroadcasting Pivot Θ(logp)Locally rearrangement Θ(n/p) Prefix sum Θ(log p)Global rearrangement Θ(n/p)
44
Parallel Formulation for Message Passing Architecture Similar to shared-address architecture Different
Array distributed to p processes
45
Pivot selection
Random selectionDrawback: bad pivot lead to significant
performance degradation
Median selectionAssumption: the initial distribution of
elements in each process is uniform
46
Outline
Issues in Sorting Sorting Networks Bubble Sort and its Variants
Quick sort Bucket and Sample sort Other sorting algorithms
47
Bucket Sort
Assumptionn elements distributed uniformly over [a, b]
IdeaDivided into m equal sized subintervalElement replacementSorted each one
Θ(nlog(n/m)) => Θ(n) Compare with QuickSort
48
Parallelization on message passing architecture
N elements, p processes=> p buckets Preliminary idea
Distributing elements n/pSubinterval, elements redistributionLocally sortingDrawback: the assumption is not realistic =>
performance degradation
Solution: Sample sorting => splittersGuarantee elements < 2n/m
50
analysis Distributing elements n/p Local sort & sample selection Θ(p) Sample combining Θ(P2),sortingΘ(p2logp),
global splitter Θ(p) elements partitioning Θ(plog(n/p)),
redistribution O(n)+O(plogp) Locally sorting
51
Outline
Issues in Sorting Sorting Networks Bubble Sort and its Variants
Quick sort Bucket and Sample sort Other sorting algorithms
52
Enumeration Sort
AssumptionO(n2) process, n elements, CRCW PRAM
FeatureBased the rank of each element
Θ(1)
53
Algorithm
1. procedure ENUM SORT (n) 2. begin 3. for each process P1,j do 4. C[j] :=0; 5. for each process Pi,j do 6. if (A[i] < A[j]) or ( A[i]= A[j] and i < j) then 7. C[j] := 1; 8. else 9. C[j] := 0; 10. for each process P1,j do 11. A[C[j]] := A[j]; 12. end ENUM_SORT
Common structure: A[n], C[n]
54
Radix Sort
Assumption n elements, n process
FeatureBased on binary presentation of the
elementsLeveraging the enumeration sorting
55
Algorithm1. procedure RADIX SORT(A, r) 2. begin 3. for i := 0 to b/r - 1 do 4. begin 5. offset := 0; 6. for j := 0 to 2^r -1 do 7. begin 8. flag := 0; 9. if the ith least significant r-bit block of A[Pk] = j then 10. flag := 1; 11. index := prefix_sum(flag) // Θ(log n) 12. if flag = 1 then 13. rank := offset + index; 14. offset := parallel_sum(flag); // Θ(log n)15. endfor 16. each process Pk send its element A[Pk] to process Prank;//Θ(n) 17. endfor 18. end RADIX_SORT