04/27/2009cs267 lecture 241 parallel sorting james demmel demmel/cs267_spr09
Post on 28-Dec-2015
219 Views
Preview:
TRANSCRIPT
Some Sorting algorithms
• Choice of algorithm depends on• Is the data all in memory, or on disk/tape?
• Do we compare operands, or just use their values?
• Sequential or parallel? Shared or distributed memory? SIMD or not?
• We will consider all data in memory and parallel:• Bitonic Sort
• Naturally parallel, suitable for SIMD architectures
• Sample Sort
• Good generalization of Quick Sort to parallel case
• Radix Sort
• Data measured on CM-5• Written in Split-C (precursor of UPC)
• www.cs.berkeley.edu/~culler/papers/sort.ps
04/27/2009 CS267 Lecture 24 2
LogP – model to predict, understand performance
Interconnection Network
MPMPMP° ° °
P ( processors )
Limited Volume( L/ g to or from
a proc)
o (overhead)
L (latency)
og (gap)
• Latency in sending a (small) message between modules
• overhead felt by the processor on sending or receiving msg
• gap between successive sends or receives (1/BW)
• Processors
Bottom Line on CM-5 using Split-C (Preview)
N/P
us
/ke
y
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
• Good fit between predicted (using LogP model) and measured (10%)• No single algorithm always best
• scaling by processor, input size, sensitivity
• All are global / local hybrids• the local part is hard to implement and model
Algorithm, #Procs
04/27/2009 CS267 Lecture 24 4
04/27/2009 CS267 Lecture 24 5
Bitonic Sort (1/2)• A bitonic sequence is one that is:
1. Monotonically increasing and then monotonically decreasing
2. Or can be circularly shifted to satisfy 1
• A half-cleaner takes a bitonic sequence and produces1. First half is smaller than smallest element in 2nd
2. Both halves are bitonic
• Apply recursively to each half to complete sorting
• Where do we get a bitonic sequence to start with?
00111000
00001011
00000111
00001011
04/27/2009 CS267 Lecture 24 6
Bitonic Sort (2/2)• A bitonic sequence is one that is:
1. Monotonically increasing and then monotonically decreasing
2. Or can be circularly shifted to satisfy
• Any sequence of length 2 is bitonic• So we can sort it using bitonic sorter on last slide
• Sort all length-2 sequences in alternating increasing/decreasing order
• Two such length-2 sequences form a length-4 bitonic sequence
• Recursively • Sort length-2k sequences in alternating increasing/decreasing
order
• Two such length-2k sequences form a length-2k+1 bitonic sequence (that can be sorted)
Bitonic Sort for n=16 – all dependencies shown
Block Layout
lg N/p stages are local sort – Use best local sort
remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)
Similar pattern as FFT: similar optimizations possible
Bitonic Sort: time per key
Predicted using LogP Model
N/P
us/
key
0
10
20
30
40
50
60
70
80
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
10
20
30
40
50
60
70
80
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
#Procs
04/27/2009 CS267 Lecture 24 8
Sample Sort
1. compute P-1 values of keys that
would split the input into roughly equal pieces.
– take S~64 samples per processor
– sort P·S keys (on processor 0)
– let keys S, 2·S, . . . (P-1)·S be “Splitters”
– broadcast Splitters
2. Distribute keys based on Splitters
Splitter(i-1) < keys Splitter(i) all sent to proc i
3. Local sort of keys on each proc
[4.] possibly reshift, so each proc has N/p keys
If samples represent total population, then Splitters should dividepopulation into P roughly equal pieces
04/27/2009 CS267 Lecture 24 9
Sample Sort: Times
Predicted
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
# Processors
04/27/2009 CS267 Lecture 24 10
Sample Sort Timing Breakdown
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Split
Sort
Dist
Split-m
Sort-m
Dist-m
Predicted and Measured (-m) times
04/27/2009 CS267 Lecture 24 11
004/19/23 CS267 Lecture 24 12
Sequential Radix Sort: Counting Sort
• Idea: build a histogram of the keys and compute position in answer array for each element
A = [3, 5, 4, 1, 3, 4, 1, 4]
• Make temp array B, and write values into position B = [1, 1, 3, 3, 4, 4, 4, 5]
• Cost = O(#keys + size of histogram)
• What if histogram too large (eg all 32-bit ints? All words?)
0
1
2
3
4
0 1 2 3 4 5
004/19/23 CS267 Lecture 24 13
Radix Sort: Separate Key Into Parts
• Divide keys into parts, e.g., by digits (radix)
• Using counting sort on these each radix:• Start with least-significant
sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sort on 3rd character
sort on 2nd character
sort on 1st character
sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
• Cost = O(#keys * #characters)
Histo-radix sort
P
n=N/P
Per pass:
1. compute local histogram
– r-bit keys, 2r bins
2. compute position of 1st
member of each bucket in
global array
– 2r scans with end-around
3. distribute all the keys
Only r = 4, 8,11,16 make sense
for sorting 32 bit numbers
2r23
004/19/23 CS267 Lecture 24 14
Histo-Radix Sort (again)
Local Data
Local Histograms
Each Passform local histogramsform global histogramglobally distribute data
P
004/19/23 CS267 Lecture 24 15
Radix Sort: Times
Predicted
N/P
us
/ke
y
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
us
/ke
y
512
256
128
64
Measured
N/P
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
32
# procs
004/19/23 CS267 Lecture 24 16
Radix Sort: Timing Breakdown
N/P
us/
key
0
20
40
60
80
100
120
14016
384
3276
8
6553
6
1E+
05
3E+
05
5E+
05
1E+
06
Dist
GlobalHist
LocalHist
Dist-m
GlobalHist-m
LocalHist-m
004/19/23 CS267 Lecture 24 17
Predicted and Measured (-m) times
Local Sort Performance on CM-5
Log N/P
µs
/ K
ey
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20
31
25.1
16.9
10.4
6.2
Entropy inKey Values
<--------- TLB misses ---------->
(11 bit radix sort of 32 bit numbers)
004/19/23 18CS267 Lecture 24
Entropy = -i pi log2 pi ,
pi = Probability of key i
Ranges from 0 to log2 (#different keys)
Radix Sort: Timing dependence on Key distribution
Entropy
µs /
ke
y
0
10
20
30
40
50
60
700 6
10
17
25
31
Cycl
ic
118
Dist
Global Hist
Local Hist
Slowdown due to contentionin redistribution
004/19/23 CS267 Lecture 24 19
Entropy = -i pi log2 pi ,
pi = Probability of key i
Ranges from 0 to log2 (#different keys)
Bottom Line on CM-5 using Split-C
N/P
us
/ke
y
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
• Good fit between predicted (using LogP model) and measured (10%)• No single algorithm always best
• scaling by processor, input size, sensitivity
• All are global / local hybrids• the local part is hard to implement and model
Algorithm, #Procs
04/27/2009 CS267 Lecture 24 20
Sorting Conclusions
• Distributed memory model leads to hybrid global / local algorithm• Use best local algorithm combined with global part
• LogP model is good enough to model global part• bandwidth (g) or overhead (o) matter most• including end-point contention• latency (L) only matters when bandwidth doesn’t
• Modeling local computational performance is harder• dominated by effects of storage hierarchy (eg TLBs), • depends on entropy
• See http://www.cs.berkeley.edu/~culler/papers/sort.ps• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps
• disk-to-disk parallel sorting
04/27/2009 CS267 Lecture 24 21
Radix: Stream Broadcast Problem
n
(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well
• Processor 0 does only sends• Others receive then send• Receives prioritized over sends Processor 0 needs to be delayed
004/19/23 CS267 Lecture 24 23
What’s the right communication mechanism?
• Permutation via writes• consistency model?• false sharing?
• Reads?• Bulk Transfers?
• what do you need to change in the algorithm?
• Network scheduling?
Comparison
• Good fit between predicted and measured (10%)
• Different sorts for different sorts• scaling by processor, input size, sensitivity
• All are global / local hybrids• the local part is hard to implement and model
N/P
us
/ke
y
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
04/27/2009 CS267 Lecture 24 26
Outline
• Some Performance Laws
• Performance analysis
• Performance modeling
• Parallel Sorting: combining models with measurments
• Reading: • Chapter 3 of Foster’s “Designing and Building Parallel Programs”
online text
• http://www-unix.mcs.anl.gov/dbpp/text/node26.html
• Abbreviated as DBPP in this lecture
• David Bailey’s “Twelve Ways to Fool the Masses”
04/19/23 CS267, Yelick 27
Measuring Performance
• Performance criterion may vary with domain
• There may be limits on acceptable running time• E.g., a climate model must run 1000x faster than real time.
• Any performance improvement may be acceptable• E.g., faster on 4 cores than on 1.
• Throughout may be more critical than latency• E.g., number of images processed per minute (throughput) vs.
total delay for one image (latency) in a pipelined system.
• Execution time per unit cost• E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt
• Parallel scalability (speedup or parallel efficiency)
• Percent relative to best possible (some kind of peak)
04/19/23 CS267, Yelick 28
Amdahl’s Law (review)
• Suppose only part of an application seems parallel
• Amdahl’s law• let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
• P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly performance is limited by the sequential part
04/19/23 CS267, Yelick 29
Amdahl’s Law (for 1024 processors)
Does this mean parallel computing is a hopeless enterprise?
Source: Gustafson, Montry, Benner
04/19/23 CS267, Yelick 30
Scaled Speedup
See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.
04/19/23 CS267, Yelick 32
Little’s Law
• Latency vs. Bandwidth• Latency is physics (wire length)
• e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room
• Bandwidth is cost:
• add more cables to increase bandwidth (over-simplification)
• Principle (Little's Law): the relationship of a production system in steady state is:
Inventory = Throughput × Flow Time
• For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency
• Required concurrency = Bandwidth * Latency
bandwidth-delay product
• For parallel computing, this means:
Concurrency = bandwidth x latency
04/19/23 CS267, Yelick 33
Little’s Law
• Example 1: a single processor:• If the latency is to memory is 50ns, and the bandwidth is 5 GB/s
(.2ns / Bytes = 12.8 ns / 64-byte cache line)
• The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed)
• An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them)
• Example 2: 1000 processor system• 1 GHz clock, 100 ns memory latency, 100 words of memory in
data paths between CPU and memory.
• Main memory bandwidth is:
~ 1000 x 100 words x 109/s = 1014 words/sec.
• To achieve full performance, an application needs:
~ 10-7 x 1014 = 107 way concurrency
(some of that may be hidden in the instruction stream)
04/19/23 CS267, Yelick 34
In Performance Analysis: Use more Data
• Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information.
• E.g., from DBPP: • Serial algorithm scales as N + N2
• Speedup of 10.8 on 12 processors with problem size N=100
• Case 1: T = N + N2/P
• Case 2: T = (N + N2)/P + 100
• Case 2: T = (N + N2)/P + 0.6*P2
• All have speedup ~10.8 on 12 procs
• Performance graphs (n = 100, 1000) show differences in scaling
04/19/23 CS267, Yelick 35
Example: Immersed Boundary Simulation
Joint work with Ed Givelberg, Armando Solar-Lezama
Time per timestep
0
20
40
60
80
100
# procs
time
(sec
s)
Pow3/SP 256 3̂
Pow3/SP 512 3̂
P4/Myr 512 2̂x256
• Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC• How useful is this data? What are ways to make is more useful/interesting?
04/19/23 CS267, Yelick 36
Performance Analysis
time breakdown
0%10%20%30%40%50%60%70%80%90%
100%
256
on 2
256
on 4
256
on 8
256
on 1
6
256
on 3
2
256
on 6
4
512
on 3
2
512
on 6
4
512
on 1
28
move
Unpack U
Send velocities
Pack U
Copy fluid U
Inverse FFTs
SolveXformXEqns
Forward FFTs
Upwind
Exchange ghost
Unpack F
Set F = 0
Send F
Pack F
Spread F
Compute F
04/19/23 CS267, Yelick 37
Building a Performance Model• Based on measurements/scaling of components
• FFT is time is: • 5*nlogn flops * flops/sec (measured for FFT)
• Other costs are linear in either material or fluid points• Measure constants:
a) # flops/point (independent machine or problem size)b) Flops/sec (measured per machine, per phase)
• Time is: a * b * #points• Communication done similarly
• Find formula for message size as function of problem size• Check the formula using tracing of some kind• Use model to predict running time: * size
04/19/23 CS267, Yelick 38
A Performance Model
• 5123 in < 1 second per timestep not possible • Primarily limited by bisection bandwidth
Performance Model Validation
0.1
1
10
100
1000
# procs
tim
e (
se
cs
)
Total time (256 model)
Total time (256 actual)
Total time (512 model)
Total time (512 actual)
04/19/23 CS267, Yelick 40
OSKI SPMV: What We Expect
• Assume• Cost(SpMV) = time to read matrix
• 1 double-word = 2 integers
• r, c in {1, 2, 4, 8}
• CSR: 1 int / non-zero
• BCSR(r x c): 1 int / (r*c non-zeros)
• As r*c increases, speedup should• Increase smoothly
• Approach 1.5 (eliminate all index overhead)
5.11
1
5.1
),(,
cr
BCSR
CSR
rccrT
TSpeedup
Deriving the LogP Model
° Processing
– powerful microprocessor, large DRAM, cache => P
° Communication
+ significant latency (100's –1000’s of cycles) => L
+ limited bandwidth (1 – 5% of memory bw) => g
+ significant overhead (10's – 100's of cycles) => o- on both ends
– no consensus on topology
=> should not exploit structure
+ limited network capacity– no consensus on programming model
=> should not enforce one
LogP
Interconnection Network
MPMPMP° ° °
P ( processors )
Limited Volume( L/ g to or from
a proc)
o (overhead)
L (latency)
og (gap)
• Latency in sending a (small) message between modules
• overhead felt by the processor on sending or receiving msg
• gap between successive sends or receives (1/BW)
• Processors
Using the LogP Model
° Send n messages from proc to proc in time
2o + L + g (n-1)
– each processor does o*n cycles of overhead
– has (g-o)(n-1) + L available compute cycles
° Send n total messages from one to many
in same time
° Send n messages from many to one
in same time
– all but L/g processors block
so fewer available cycles, unless scheduled carefully
o L o
o og
Ltime
P
P
Use of the LogP Model (cont)
° Two processors sending n words to each other (i.e., exchange)
2o + L + max(g,2o) (n-1) max(g,2o) + L
° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic
exercise: what’s wrong with the formula above?
Assumes optimal pattern of send/receive, so could underestimate time
o L o
o og
L
o
LogP "philosophy"
• Think about:
• – mapping of N words onto P processors
• – computation within a processor, its cost, and balance
• – communication between processors, its cost, and balance
• given a charaterization of processor and network performance
• Do not think about what happens within the network
This should be good enough!
Typical Sort
Exploits the n = N/P grouping
° Significant local computation
° Very general global communication / transformation
° Computation of the transformation
Costs Split-C (UPC predecessor) Operations
• Read, Write x = *G, *G = x 2 (L + 2o)• Store *G :– x L + 2o• Get x := *G o .... 2L + 2o
sync(); o•with interval g
• Bulk store (n words with words/message)
2o + (n-1)g + L
• Exchange 2o + 2L + (nL/g) max(g,2o)
• One to many• Many to one
LogP model
• CM5:• L = 6 µs• o = 2.2 µs• g = 4 µs• P varies from 32 to 1024
• NOW• L = 8.9• o = 3.8• g = 12.8• P varies up to 100
• What is the processor performance?• Application-specific• 10s of Mflops for these machines
04/19/23 CS267, Yelick 54
LogP Parameters Today
0
1020
304050
607080
90100
4*o + 2*L g
Millennium (MPICH/M2K)
SP Loopback (MPI)
SP (MPI)
Quadrics loopback (MPI, ORNL)
Quadrics (MPI,ORNL)
Quadrics (Shmem, ORNL)
Quadrics (UPC, ORNL)
Myrinet(GM)
MVICH/Giganet
VIPL/Giganet/VIA
MVICH/M-VIA/Syskonnect
VIPL/M-VIA/SysKonnect
T3E Loopback (MPI/NERSC)
T3E (MPI, NERSC)
T3E UPC (NERSC)
Local Computation Parameters - Empirical
Parameter Operation µs per key Sort
Swap Simulate cycle butterfly per key 0.025 lg N Bitonic
mergesort Sort bitonic sequence 1.0
scatter Move key for Cyclic-to-block 0.46
gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column
1.1 otherwise
local sort Local radix sort (11 bit) 4.5 if n < 64K
9.0 - (281000/n)
merge Merge sorted lists 1.5 Column
copy Shift Key 0.5
zero Clear histogram bin 0.2 Radix
hist produce histogram 1.2
add produce scan value 1.0
bsum adjust scan of bins 2.5
address determine desitination 4.7
compare compare key to splitter 0.9 Sample
localsort8 local radix sort of samples 5.0
Odd-Even Merge - classic parallel sort
N values to be sorted
A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1
Treat as two lists ofM = N/2
Sort each separately
A0 A2 … AM-2 B0 B2 … BM-2
Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1
Merge into twosorted lists
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
Pairwise swaps ofEi and Oi will put itin order
Mapping to a Butterfly (or Hypercube)
A0 A1 A2 A3 B0 B1 B2 B3
A0 A1 A2 A3
A0 A1 A2 A3
B0B1 B3 B2
B2B3 B1 B0
A0 A1 A2 A3 B2B3 B1 B0
Reverse Orderof one list viacross edges
two sorted sublists
Pairwise swapson way back2 3 4 8 7 6 5 1
2 3 4 7 6 5 81
2 4 6 81 3 5 7
1 2 3 4 5 6 7 8
Bitonic Sort with N/P per node
all_bitonic(int A[PROCS]::[n])sort(tolocal(&A[ME][0]),n,0)for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); }
sort(tolocal(&A[ME][0]),n,mask(i));
sortswap
A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences
Bitonic: Breakdown
Predicted
N/P
us/k
ey
0
10
20
30
40
50
60
70
80
16384
32768
65536
131072
262144
524288
1048576
Measured
N/P
us/k
ey
0
10
20
30
40
50
60
70
80
16384
32768
65536
131072
262144
524288
1048576
Remap B-C
Remap C-B
Mergesort
Swap
Localsort
P= 512, random
Bitonic: Effect of Key Distributions
Entropy (bits)
µs/k
ey
0
5
10
15
20
25
30
35
40
45
0 6 10 17 25 31
Swap
Merge Sort
Local Sort
Remap C-B
Remap B-C
P = 64, N/P = 1 M
04/19/23 CS267, Yelick 62
Sequential Radix Sort: Counting Sort
• Idea: build a histogram of the keys and compute position in answer array for each element
A = [3, 5, 4, 1, 3, 4, 1, 4]
• Make temp array B, and write values into position
0
1
2
3
4
0 1 2 3 4 5
1 5 8 7 9 2 4 4 2 4 9 2 4 9 7 2 4 8 7 2 4 9
5 8 7 2 4 9
1 3 4 4 1 3 4
5
top related