parallel programming stuff jud leonard february 28, 2008
DESCRIPTION
3 Outline Parallel problems –Simulation Models –Imaging –Monte Carlo methods –Embarrassing Parallelism Software issues due to parallelism –Communication –Synchronization –Simultaneity –DebuggingTRANSCRIPT
![Page 1: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/1.jpg)
Parallel Programming & StuffJud Leonard
February 28, 2008
![Page 2: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/2.jpg)
2
SiCortex Systems
![Page 3: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/3.jpg)
3
Outline
• Parallel problems– Simulation Models– Imaging– Monte Carlo methods– Embarrassing Parallelism
• Software issues due to parallelism– Communication– Synchronization– Simultaneity– Debugging
![Page 4: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/4.jpg)
4
Limits to Scaling
• Amdahl’s Law: serial eventually dominates– Seldom the limitation in practice– Gustafson: Big problems have lots of parallelism
• Often in practice, communication dominates– Each node treats a smaller volume– Each node must communicate with more partners– More, smaller messages in the fabric
• Improved communication enables scaling
• Communication is key to higher performance
![Page 5: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/5.jpg)
5
Physical System Simulations
• Spatial partition of problem– Works best if compute load evenly distributed
• Weather, Climate• Fluid dynamics
– Complex boundary management after load balancing• Partition criteria must balance:
– Communication– Compute– Storage
![Page 6: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/6.jpg)
6
Example: 3D Convolution• Operate on N3 array with M3 processors• Result is a weighted sum of neighbor points
• Single-processor– no communication cost– Compute time ≈ N3
• 3D partition– Communication ≈ (N/M)2
– Compute Time ≈ (N/M)3
![Page 7: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/7.jpg)
7
Scalability of 3D Convolution
Effect of Cost Ratio on Scaling Efficiency
Scaling Efficiency
1%
10%
100%
1 10 100 1000 10000 100000 1000000 10000000 1E+08 1E+09 1E+10Number of Processors
Elap
sed
Tim
e
10x scaling 100x scaling
![Page 8: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/8.jpg)
8
Example: Logic Simulation
• Modern chips contain many millions of gates– Enormous inherent parallelism in model
• Product quality depends on test coverage– Economic incentive
• Perfect application for parallel simulation– Why has nobody done it?
• Communication costs• Complexity of partition problem
– Multidimensional non-linear optimization
![Page 9: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/9.jpg)
9
Example: Seismic Imaging
• Similar to Radar, Sonar, MRI…• Record echoes of a distinctive signal
– Correlate across time and space– Estimate remote structure from variation in echo
delay at multiple sensors• Terabytes of data
– Need efficient algorithms– Every sensor affected by the whole structure– How to partition for efficiency?
![Page 10: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/10.jpg)
10
New Issues due to Parallelism
• Communication costs– My memory is more accessible than others
• Planning, sequencing halo exchanges– Bulk transfers most efficient
• but take longer– Subroutine syntax vs Language intrinsic– Coherence and synchronization explicitly managed– Issues of grain size
• Synchronization– Coordination of “loose” parallelism
• Identification of necessary sync points
![Page 11: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/11.jpg)
11
Mind Games
• Simultaneity– Contrary to habitual sequential mindset– Access to variables is not well-ordered between
parallel threads– Order is not repeatable
• Debugging– Printf?– Breakpoints?– Timestamps?
![Page 12: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/12.jpg)
12
Interesting Problems - Parallelism
• Event-driven simulation• Load balancing• Debugging
– Correctness• Dependency• Synchronization
– Performance• Critical paths
![Page 13: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/13.jpg)
13
The Kautz Digraph
• Log diameter (base 3, in our case)– Reach any of 972 nodes in 6 or fewer steps
• Multiple disjoint paths– Fault tolerance– Congestion avoidance
• Large bisection width– No choke points as network grows
• Natural tree structure– Parallel broadcast & multicast– Parallel barriers & collectives
![Page 14: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/14.jpg)
14
Alphabetic Construction
• Node names are strings of length k (diameter)– Alphabet of d+1 letters (d = degree)– No letter repeats in adjacent positions– ABAC: allowed– ABAA: not allowed
• Network order = (d+1)dk-1
– d+1 choices for first letter– d choices for (k-1) letters
• Connections correspond to shifts– ABAC, CBAC, DBAC -> BACA, BACB, BACD
![Page 15: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/15.jpg)
15
Noteworthy• Most paths simply shift in destination ID
– ABCD -> BCDB -> CDBA -> DBAD -> BADC• Unless tail overlaps head
– ABCD -> BCDA -> CDAB• A few nodes have bidirectionally-connected
neighbors– ABAB <-> BABA
• A “necklace” consists of nodes whose names are merely rotations of each other– ABCD -> BCDA -> CDAB -> DABC -> ABCD again
![Page 16: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/16.jpg)
16
Whatsa Kautz Graph?
3
2
0
1
Diam Order1 42 123 364 1085 3246 972
![Page 17: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/17.jpg)
17
Kautz Graph Topology
11
10
9
8 7 6
0 1 2
3
4
5
Diam Order1 42 123 364 1085 3246 972
![Page 18: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/18.jpg)
18
Whatsa Kautz Graph?
35
34
33
32
31
30
29
28
27
26 25 24 23 22 21 20 19 18
0 1 2 3 4 5 6 7 8
9
10
11
12
13
14
15
16
17
Diam Order1 42 123 364 1085 3246 972
![Page 19: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/19.jpg)
19
Interconnect Fabric
• Logarithmic diameter– Low latency– Low contention– Low switch degree
• Multiple paths – Fault tolerant to link,
node, or module failures– Congestion avoidance
• Cost-effective– Scalable– Modular
L2 Cache PCIe
Fabric Switch
DMA
Memory Control
CPU
CacheCPU
Cache
CPU
CacheCPU
Cache
CPU
CacheCPU
Cache
DDR DIMMDDR DIMM
![Page 20: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/20.jpg)
20
DMA Engine API• Per-process structures:
– Command and Event queues in user space– Buffer Descriptor table (writable by kernel only)– Route Descriptor table (writable by kernel only)– Heap (User readable/writable)– Counters (control conditional execution)
• Simple command set:– Send Event: immediate data for remote event queue– Put Im Heap: immediate data for remote heap– Send Command: nested command for remote exec– Put Buffer to Buffer: RDMA transfer– Do Command: conditionally execute command string
![Page 21: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/21.jpg)
21
Interesting Problems - SiCortex• Collectives optimized for Kautz digraph
– Optimization for a subset– Primitive operations
• Partitions– Best subsets to choose – Best communication pattern within a subset
• Topology mapping– N-dimensional mesh– Tree– Systolic array
• Global shared memory
![Page 22: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/22.jpg)
22
Brains and Beauty, too!
![Page 23: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/23.jpg)
23
ICE9 Die Layout
![Page 24: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/24.jpg)
24
27-node ModulePCIe Express Module Options
ICE9 Node Chip
DDR2 DIMM
Dual Gigabit Ethernet
Fibre Channel10 Gb Ethernet
InfiniBand
Power regulatorBackpanel Connector
Module Service Processor
MSP Ethernet
![Page 25: Parallel Programming Stuff Jud Leonard February 28, 2008](https://reader035.vdocuments.us/reader035/viewer/2022062907/5a4d1b767f8b9ab0599b727a/html5/thumbnails/25.jpg)
25
What’s new or unique? What’s not?• Designed for HPC• It’s not x86
– Performance = low power• Communication
– Kautz digraph topology– Messaging: 1st class op– Mesochronous cluster
• Open source everything• Performance counters• Reliable by design
– ECC everywhere– Thousands of monitors
• Factors of 3• Lighted gull wing doors!
• Linux (Gentoo)• Little-endian• MIPS-64 ISA• Pathscale compiler• GNU toolchain• IEEE Floating Point• MPI• PCI Express I/O