featured attraction: computers for doing big science bill camp, sandia labs 2nd feature: hints on...
TRANSCRIPT
![Page 1: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/1.jpg)
Featured attraction:
Computers for Doing Big Science
Bill Camp, Sandia Labs
2nd Feature: Hints on MPP
computing
![Page 2: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/2.jpg)
Sandia MPPs (since 1987)
1987: 1024-processor nCUBE10 [512 Mflops] 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops] 1988--1990: 16384-processor CM-200 1991: 64-processor Intel IPSC-860 1993--1996: ~3700-processor Intel Paragon [180 Gflops] 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops] 1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops] 2003: 1280-processor IA32- Linux cluster [~7 Tflops] 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops] 2005: ~1280-Processor 64-bit Linux Cluster [~10 TF] 2006 Red Storm upgrade ~20K nodes, 160 TF. 2008--9 Red Widow ~ 50K nodes, 1000 TF. (?)
![Page 3: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/3.jpg)
Computing domains at Sandia
Red Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)
Domain
# Procs 1 101 102 103 104
Red StormX X X
Cplant Linux Supercluster
X X X
Beowulf clusters X X X
Desktop X
VolumeMid-Range
PeakBig Science
![Page 4: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/4.jpg)
RS Node architecture
CPUsAMD
Opteron
DRAM 1 (or 2) Gbyte or more
ASICNIC +Router
Six LinksTo Other
Nodes in X, Y,and Z
ASIC = ApplicationSpecific Integrated
Circuit, or a“custom chip”
![Page 5: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/5.jpg)
3-D Mesh topology (Z direction is a torus)
10,368Compute
Node Mesh
X=27
Y=16
Z=24
TorusInterconnect
in Z
640 V
isualization S
ervice &
I/O N
odes
640
Vis
ualiz
atio
n,
Ser
vice
& I
/O N
odes
![Page 6: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/6.jpg)
Comparison of ASCI Redand Red Storm
ASCI Red Red Storm
Full System Operational Time Frame June 1997 (processor and memory upgrade in 1999)
August 2004
Theoretical Peak (TF)-- compute partition alone
3.15 41.47
MP-Linpack Performance (TF) 2.38 >30 (estimated)
Architecture Distributed Memory MIMD Distributed Memory MIMD
Number of Compute Node Processors 9,460 10,368
Processor Intel P II @ 333 MHz AMD Opteron @ 2 GHz
Total Memory 1.2 TB 10.4 TB (up to 80 TB)
System Memory Bandwidth 2.5 TB/s 55 TB/s
Disk Storage 12.5 TB 240 TB
Parallel File System Bandwidth 2.0 GB/s 100.0 GB/s
External Network Bandwidth 0.4 GB/s 50 GB/s
![Page 7: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/7.jpg)
Comparison of ASCI Redand Red Storm
ASCI Red RED STORM
Interconnect Topology 3D Mesh (x, y, z)
38 x 32 x 2
3D Mesh (x, y, z)27 x 16 x 24
Interconnect Performance MPI Latency
Bi-Directional Bandwidth Minimum Bi-section Bandwidth
15 s 1 hop, 20 s max800 MB/s51.2 GB/s
2.0 s 1 hop, 5 s s max6.4 GB/s2.3 TB/s
Full System RAS RAS Network RAS Processors
10 Mbit Ethernet
1 for each 32 CPUs
100 Mbit Ethernet1 for each 4 CPUs
Operating System Compute Nodes Service and I/O Nodes RAS Nodes
CougarTOS (OSF1 UNIX)
VX-Works
CatamountLINUXLINUX
Red/Black Switching 2260 – 4940 – 2260 2688 – 4992 - 2688
System Foot Print ~2500 ft2 ~3000 ft2
Power Requirement 850 KW 1.7 MW
![Page 8: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/8.jpg)
Red Storm Project
Goal: 23 months, design to First Product Shipment! System software is a joint project between Cray and Sandia
Sandia is supplying Catamount LWK and the service node run-time system Cray is responsible for Linux, NIC software interface, RAS software, file
system software, and Totalview port Initial software development was done on a cluster of workstations with a
commodity interconnect. Second stage involved an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout is on real SEASTAR-based system
System engineering is wrapping up! Cabinets-- exist SEASTAR NIC/Router-- RTAT back from Fabrication at IBM late last month
Full system to be installed and turned over to Sandia in stages culminating in August--December 2004
![Page 9: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/9.jpg)
Designing for scalable scientific supercomputing
Challenges in: -Design-Integration-Management-Use
![Page 10: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/10.jpg)
Design SUREty for Very Large Parallel Computer Systems
Scalability - Full System Hardware and System Software
Usability - Required Functionality Only
Reliability - Hardware and System Software
Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:
![Page 11: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/11.jpg)
SURE Architectural tradeoffs:• Processor and memory sub-
system balance• Compute vs interconnect balance• Topology choices• Software choices• RAS• Commodity vs. Custom technology• Geometry and mechanical design
![Page 12: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/12.jpg)
Sandia Strategies:-build on commodity-leverage Open Source (e.g., Linux)-Add to commodity selectively (in RS there is basically one truly custom part!)-leverage experience with previous scalable supercomputers
![Page 13: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/13.jpg)
System Scalability Driven Requirements
Overall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics, & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).
-
![Page 14: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/14.jpg)
ScalabilitySystem Software;System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.
- Full re-boot time scales logarithmically with the system size.- Job loading is logarithmic with the number of processors.- Parallel I/O performance is not sensitive to # of PEs doing I/O- Communication Network software must be scalable.
- prefer no connection-based protocols among compute nodes.- Message buffer space independent of # of processors.- Compute node OS gets out of the way of the
application.
![Page 15: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/15.jpg)
Hardware scalability•Balance in the node hardware:
•Memory BW must match CPU speed
Ideally 24 Bytes/flop (never yet done)
•Communications speed must match CPU speed
•I/O must match CPU speeds
•Scalable System SW( OS and Libraries)
•Scalable Applications
![Page 16: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/16.jpg)
Usability>Application Code Support:
Software that supports scalability of the Computer System
Math LibrariesMPI Support for Full System SizeParallel I/O LibraryCompilers
Tools that Scale to the Full Size of the Computer System
DebuggersPerformance Monitors
Full-featured LINUX OS support at the user interface
![Page 17: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/17.jpg)
Reliability
Light Weight Kernel (LWK) O. S. on compute partition Much less code fails much less often
Monitoring of correctible errors Fix soft errors before they become hard
Hot swapping of components Overall system keeps running during maintenance
Redundant power supplies & memories Completely independent RAS System monitors virtually
every component in system
![Page 18: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/18.jpg)
Economy
1. Use high-volume parts where possible2. Minimize power requirements
Cuts operating costsReduces need for new capital
investment3. Minimize system volume
Reduces need for large new capital facilities
4. Use standard manufacturing processes where possible-- minimize customization
5. Maximize reliability and availability/dollar6. Maximize scalability/dollar7. Design for integrability
![Page 19: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/19.jpg)
Economy
Red Storm leverages economies of scale AMD Opteron microprocessor & standard memory Air cooled Electrical interconnect based on Infiniband physical devices Linux operating system
Selected use of custom components System chip ASIC
• Critical for communication intensive applications
Light Weight Kernel• Truly custom, but we already have it (4th generation)
![Page 20: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/20.jpg)
Cplant on a slide
Net I/O
System Support
Service
Sys Admin
Users
File I/O
Compute
/home
other
I/ONodes
Compute NodesService Nodes
……
……
……
… … … …
Ethernet
ATM
Operator(s)
HiPPI
I/O Nodes
System
Goal: MPP “look and feel”
• Start ~1997, upgrade ~1999--2001
• Alpha & Myrinet, mesh topology
• ~3000 procs (3Tf) in 7 systems
• Configurable to ~1700 procs
• Red/Black switching
• Linux w/ custom runtime & mgmt.
• Production operation for several yrs.
ASCI Red
![Page 21: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/21.jpg)
IA-32 Cplant on a slide
Net I/O
System Support
Service
Sys Admin
Users
File I/O
Compute
/home
other
I/ONodes
Compute NodesService Nodes
……
……
……
… … … …
Ethernet
ATM
Operator(s)
HiPPI
I/O Nodes
System
Goal: Mid-range capacity
• Started 2003, upgrade annually
• Pentium-4 & Myrinet, Clos network
• 1280 procs (~7 Tf) in 3 systems
• Currently configurable to 512 procs
• Linux w/ custom runtime & mgmt.
• Production operation for several yrs.
ASCI Red
![Page 22: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/22.jpg)
Observation:For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.
There must be balance between processor, interconnect, and I/O performance to achieve overall performance.
To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.
![Page 23: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/23.jpg)
Let’s Compare Balance In Parallel Systems
10000
2500
24000
2650
64000
500
1000
666
1200
400
Node Speed Rating(MFlops)
0.2650Q*
0.04400Q**
0.0832000White
0.11 (0.05)300 (132)Blue Pacific
0.02 (0.16*)1200 (9600*)BlueMtn**
1.6800Blue Mtn*
0.14140Cplant
(1.2)0.67800(533)ASCI RED**
11200T3E
2(1.33)800(533)ASCI RED
Communications Balance
(Bytes/flop)
Network Link BW
(Mbytes/s)Machine
![Page 24: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/24.jpg)
Comparing Red Storm and BGL
Blue Gene Light** Red Storm*
Node Speed 5.6 GF 5.6 GF (1x)
Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)
Network latency 7 secs 2 secs (2/7 x)
Network link BW 0.28 GB/s 6.0 GB/s (22x)
BW Bytes/Flops 0.05 1.1 (22x)
Bi-Section B/F 0.0016 0.038 (24x)
#nodes/problem 40,000 10,000 (1/4 x)
*100 TF version of Red Storm
* * 360 TF version of BGL
![Page 25: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/25.jpg)
Fixed problem performance
Molecular dynamics problem(LJ liquid)
![Page 26: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/26.jpg)
Scalable computing works
ASCI Red efficiencies for major codes
0
20
40
60
80
100
1 10 100 1000 10000
Processors
Scaled parallel efficiency (%)
QS-Particles
QS-Fields-Only
QS-1B Cells
Rad x-port-1B Cells
Rad x-port - 17M
Rad x-port - 80M
Rad x-port - 168M
Rad x-port - 532M
Finite Element
Zapotec
Reactive Fluid Flow
Salinas
CTH
![Page 27: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/27.jpg)
Basic Parallel Efficiency Model
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Communication/Computation Load
Parallel Efficiency
Red Storm (B=1.5)
ASCI Red (B=1.2)
Ref. Machine (B=1.0)
Earth Sim. (B=.4)
Cplant (B=.25)
Blue Gene Light (B=.05)
Std. Linux Cluster (B=.04)
Balance is critical to scalability
PeakLin
pack
Scientific & eng. codes
![Page 28: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/28.jpg)
Relating scalability and cost
0.00
1.00
2.00
3.00
4.00
5.00
6.00
1 2 4 8 16 32 64 128 256 512 1024 2048 4096Processors
Efficiency ratio (Red/Cplant)
Eff. Ratio Extrapolation
Efficiency ratio =Cost ratio = 1.8
MPP more cost effective
Cluster more cost effective
Average efficiency ratio over the five codes that consume >80% of Sandia’s cycles
![Page 29: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/29.jpg)
Scalability determines cost effectiveness
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
1 10 100 1000 10000
Number of Nodes
Total Node-Hours of Jobs
380M node-hrs55M node-hrs
MPP more cost effective
Cluster more cost effective
256
Sandia’s top priority computing workload:
![Page 30: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/30.jpg)
Scalability also limits capability
ITS Speedup curves
0
200
400
600
800
1000
1200
0128256384512640768896
1024115212801408Processors
Speedup
Red Speedup
Cplant Speedup
Poly. (RedSpeedup)
Poly. (CplantSpeedup)
~3x processors
![Page 31: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/31.jpg)
Commodity nearly everywhere-- Customization drives cost
• Earth Simulator and Cray X-1 are fully custom Vector systems with good balance• This drives their high cost (and their high performance).
• Clusters are nearly entirely high-volume with no truly custom parts• Which drives their low-cost (and their low scalability)
• Red Storm uses custom parts only where they are critical to performance and reliability• High scalability at minimal cost/performance
![Page 32: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/32.jpg)
![Page 33: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/33.jpg)
“Honey It’s not one of those…”or
Hints on MPP Computing
[Excerpted from a talk with this title given by Bill Camp at CUG-Tours in October 1994]
![Page 34: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/34.jpg)
Issues in MPP Computing:
1. Physically shared memory does not scale
2. Data must be distributed
3. No single data layout may be optimal
4. The optimal data layout may change during the computation
5. Communications are expensive
6. The single control stream in SIMD computing makes it simple-- at the cost of severe loss in performance-- due to load balancing problems
7. In data parallel computing (‘a la CM-5) there can be multiple control streams-- but with global synchronization
Less simple but overhead remains an issue
8. In MIMD computing there are many control streams loosely synchronized (eg with messages)
Powerful, flexible and complex
![Page 35: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/35.jpg)
Why doesn’t shared memory scale?
Switch
CPU CPU CPU CPUCPUCPU CPU
cache cache cache cache cache cache cache
memory memory memory memory memory memory memory
Bank conflicts-- about a 40% hit for large # of banks and CPU’sMemory coherency-- who has the data, can I access it?High, non-deterministic latencies
![Page 36: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/36.jpg)
Amdahl’s Law
Time on a single processor:
T1 = Tser + Tser
Time on P processors:
Tp = Tser + Tpar/P + Tcomm’s
Ignore communications:
Speedup, Sp (= T1/ Tp) is then
Sp = { fser + [ 1 - fser ] / P}-1
Where fser = Tser / Tser
So, Sp < 1 / fser
![Page 37: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/37.jpg)
The Axioms
Axiom 1: Amdahl’s Law is inviolate (Sp < 1 / fser )
Axiom 2: Amdahl’ Law doesn’t matter for MPP if you know what you are doing (Comm’s dominate)
Axiom 3 : Nature is parallel
Axiom 4 : Nature is (mostly) local
Axiom 5 : Physical shared memory does not scale
Axiom 6 : Physically distributed memory does
Axiom 7 : Nevertheless, a global address space is nice to have
![Page 38: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/38.jpg)
The Axioms
Axiom 8: Like solar energy, automatic parallelism is the technology of the future
Axiom 9: successful parallelism requires the near total suppression of serialism
Axiom 10 : The best thing you can do with a processor is serial execution
Axiom 11 : Axioms 9 &10 are not contradictory
Axiom 12 : MPPs are for doing large problems fast (if you need to do a small problem fast, look
elsewhere).
Axiom 13 : Generals build weapons to win the last war (so computer scientists)
![Page 39: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/39.jpg)
The Axioms
Axiom 14 : first find coarse-grained, then medium-grained, then fine-grained parallelism
Axiom 15: done correctly, the gain from these is multiplicative
Axiom 16 : Life’s a balancing act; so’s MPP computing
Axiom 17 : Be an introvert-- never communicate needlessly
Axiom 18 : Be independent; never synchronize needlessly
Axiom 19 : Parallel computing is a cold world, bundle up well
![Page 40: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/40.jpg)
The Axioms
Axiom 20 : I/O should only be done under medical supervision
Axiom 21: If MPP computin’ is easy it ain’t cheap
Axiom 22 : If MPP computin’ is cheap, it ain’t easy
Axiom 23 : The difficulty of programming an MPP effectively is directly proportional to latency
Axiom 24 : The parallelism is in the problem, not in the code
![Page 41: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/41.jpg)
The Axioms
Axiom 25 : There are an infinite number of parallel algorithms
Axiom 26 : There are no parallel algorithms (Simon’s theorem)-- it’s almost true
Axiom 27: The best parallel algorithm is almost always a parallel implementation of the best serial algorithm (what Horst really meant)
Axiom 28 : Amdahl’s Law DOES limit vector speedup!
Axiom 18’ : Work in teams ( sometimes SIMD constructs are just what the Doctor ordered)!
Axiom 29 : Do try this at home!
![Page 42: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/42.jpg)
(Some of) the Hints
Hint 1:
Any amount of serial computing is death
So… 1) make the problem large
2) Look everywhere for serialism and purge it from your code
3) Never, ever, ever add serial statements
![Page 43: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/43.jpg)
(Some of) the Hints
Hint 2:
Keep communications in the noise!
So… 1) Don’t do little problems on big computers
2) Change algorithms when profitable
3) Bundle up!-- avoid small messages on high-latency interconnects
4) Don’t waste memory-- using all the memory on a node minimizes the ratio of communications to useful work
![Page 44: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/44.jpg)
(Some of) the Hints
Hint 3:
The parallelism is in the problem!
E.G. SAR, Monte Carlo, Direct Sparse solvers, Molecular Dynamics
So,… 1) Look first at the problem
2) Look second at algorithms
3) Look at data structures in the code
4) don’t waste cycles on line-by-line parallelism
![Page 45: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/45.jpg)
(Some of) the Hints
Hint 4:
Incremental Parallelism Is Too Inefficient! Don’t fiddle with the Fortran
Look at the Problem:
-- Identify the kinds of parallelism it contains
1) Multi-program
2) Multi-task
4) data parallelism
5) inner-loop parallelism (e.g. vectors)
Time
into
effort
![Page 46: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/46.jpg)
(Some of) the Hints
Hint 5:
Often:With Explicit Message Passing (EMP) or Gets/Puts
You can re-use virtually all of your code
(changes and additions ~ few%)
-- With data parallel languages, you re-write your code
It can be easy
but
Performance is usually unacceptable
![Page 47: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/47.jpg)
(Some of) the Hints
Hint 6:
Load Balancing (Use existing libraries and technology)
-Easy in EMP!
-Hard (or impossible) in HPF, F90, CMF, …
-Only load balance if Tnew + Tbal < Told
Static or Dynamic:
Graph-based
geometry based
Particle-based
Hierarchical Master-Slave
![Page 48: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/48.jpg)
(Some of) the Hints
Hint 7:
Synchronization is expensive
So, … Don’t do it unless you have to
Never, ever put in synchronization just to get rid of a bug
else you’ll be stuck with it for the life of the code!
![Page 49: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/49.jpg)
(Some of) the Hints
Hint 8:
I/O can ruin your whole afternoon:
It is amazing how many people will create wonderfully scalable codes only to spoil them with needless or serial or non-balanced I/O
Use I/O sparingly
Stage I/O carefully
![Page 50: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/50.jpg)
(Some of) the Hints
Hint 9:
Religious prejudice is the bane of computing
Caches aren’t inherently bad
Vectors aren’t inherently good
Small SMP’s will not ruin your life
Single processer nodes are not killers
…
![Page 51: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/51.jpg)
La Fin (The End)
![Page 52: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/52.jpg)
Scaling data for some key engineering codes
Performance on Engineering Codes
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 2 4 8 16 32 64 128 256 512 1024Processors
Scaled Parallel Efficiency
ITS, Red
ITS, Cplant
ACME, Red
ACME, Cplant
Random variation at small proc. counts
Large differential in efficiency at large proc. counts
![Page 53: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/53.jpg)
Scaling data for some key physics codes
Los Alamos’ Radiation transport
code
PARTISN Diffusion Solver Sizeup StudyS6P2, 12 Groups, 13,800 cells/PE
0%
20%
40%
60%
80%
100%
120%
1 2 4 8 16 32 6412825651210242048
Number of Processor Elements
Parallel Efficiency
ASCI Red
Blue Mountain
White
QSC
PARTISN Transport Solver Sizeup StudyS6P2, 12 Groups, 13,800 cells/PE
0%
20%
40%
60%
80%
100%
120%
1 2 4 8 16 32 6412825651210242048
Number of Processor Elements
Parallel Efficiency
ASCI Red
Blue Mountain
White
QSC
![Page 54: Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d0d5503460f949e1e16/html5/thumbnails/54.jpg)
Parallel Sn Neutronics (provided by LANL)