high performance computing (hpc) introductionnolta/hpc_intro.pdf · 5 hpc programming models &...
TRANSCRIPT
![Page 1: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/1.jpg)
High Performance Computing (HPC) Introduction
Ontario Summer School onHigh Performance Computing
Mike NoltaSciNet HPC Consortium
July 11, 2016
![Page 2: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/2.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 3: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/3.jpg)
Acknowledgments
Contributing Material
Intro to HPC - S. Northrup, M. Ponce, SciNet
SciNet Parallel Scientific Computing Course- L. J. Dursi & R. V. Zon, SciNet
Parallel Computing Models - D. McCaughan, SHARCNET
High Performance Computing - D. McCaughan, SHARCNET
HPC Architecture Overview - H. Merez, SHARCNET
Intro to HPC - T. Whitehead, SHARCNET
![Page 4: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/4.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 5: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/5.jpg)
SciNet is ...
. . . a consortium for High-Performance Computing consisting ofresearchers at U. of T. and its associated hospitals.
One of 7 consortiain Canada thatprovide HPCresources toCanadian academicresearchers andtheir collaborators.
![Page 6: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/6.jpg)
SciNet is ...
. . . home to the 1st and 2nd most powerful researchsupercomputers in Canada (and a few more)
![Page 7: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/7.jpg)
SciNet is ...
. . . where to take courses on computational topics, e.g.
Intro to SciNet, Linux Shell
Scientific Programming (Modern FORTRAN / C++)
GPGPU with CUDA
Intro to Research Computing with Python
Scientific Computing Course (for credits forphysics/astro/chem grads)
Ontario HPC Summerschool
Certificates Program: HPC, Scientific Comp. & Data Science
https://support.scinet.utoronto.ca/education/
![Page 8: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/8.jpg)
SciNet people
. . . technical analysts who canwork directly with you.
Mike Nolta
Scott Northrup
Marcelo Ponce
Erik Spence
Ramses van Zon
Daniel Gruner(CTO-software)
+ the people that make sureeverything runs smoothly.
Jaime Pinto
Joseph Chen
Jason Chong
Ching-Hsing Yu
Leslie Groer
Chris Loken (CTO)
+ Technical director: Prof. Richard Peltier
+ Business manager: Teresa Henriques
+ Project coordinator: Alex Dos Santos/Jillian Dempsey
![Page 9: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/9.jpg)
How to get an account
Any qualified researcher at a Canadian university can get a SciNetaccount through the following process:
1 Register for a Compute Canada Database (CCDB) account
2 Non-faculty need a sponsor (supervisor’s CCRI number),who has to have a SciNet account already.
3 Login to CCDB and apply for a SciNet account(click Apply beside SciNet on the Consortium Accounts page)
4 Agree to the Acceptable Usage Policy(e.g., don’t share account, respect others, we can monitoryour jobs)
![Page 10: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/10.jpg)
Resources at SciNet
General Purpose Cluster (GPC)
3864 nodes with two 2.53GHz quad-core IntelXeon 5500 (Nehalem) x86-64 processors
16 GB RAM per node
16 threads per node
1:1 DDR (840 nodes) and 5:1 QDR (3024nodes) Infiniband Interconnect
306 TFlops (261 HPL)
debuted at #16 on the June 2009 TOP500supercomputer list
currently (Jul. 2016) #552
![Page 11: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/11.jpg)
Resources at SciNet
General Purpose Cluster (GPC)
3864 nodes with two 2.53GHz quad-core IntelXeon 5500 (Nehalem) x86-64 processors
16 GB RAM per node
16 threads per node
1:1 DDR (840 nodes) and 5:1 QDR (3024nodes) Infiniband Interconnect
306 TFlops (261 HPL)
debuted at #16 on the June 2009 TOP500supercomputer list
currently (Jul. 2016) #552
![Page 12: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/12.jpg)
Other Compute Resources at SciNet
Tightly Coupled System (TCS) Power 7 Linux Cluster (P7)
GPU Devel Nodes (ARC/Gravity)Blue Gene/Q (BGQ)
![Page 13: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/13.jpg)
Storage Resources at SciNet
Disk space1.4 PB of storage in 1790 drives
Two controllers each delivering 4-5 GB/s (r/w)
Shared file system GPFS on all systems
Storage space
HPSS: 5TB Tape-backed storage
![Page 14: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/14.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 15: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/15.jpg)
Scientific High Performance Computing
What is it?
HPC is essentially leveraging larger and/or multiple computers tosolve computations in parallel.
What does it involve?
hardware - pipelining, instruction sets, multi-processors,inter-connects
algorithms - concurrency, efficiency, communications
software - parallel approaches, compilers, optimization,libraries
When do I need HPC?
My problem takes too long→ more/faster computation
My problem is too big→ more memory
My data is too big→ more storage
![Page 16: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/16.jpg)
Scientific High Performance Computing
What is it?
HPC is essentially leveraging larger and/or multiple computers tosolve computations in parallel.
What does it involve?
hardware - pipelining, instruction sets, multi-processors,inter-connects
algorithms - concurrency, efficiency, communications
software - parallel approaches, compilers, optimization,libraries
When do I need HPC?
My problem takes too long→ more/faster computation
My problem is too big→ more memory
My data is too big→ more storage
![Page 17: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/17.jpg)
Scientific High Performance Computing
What is it?
HPC is essentially leveraging larger and/or multiple computers tosolve computations in parallel.
What does it involve?
hardware - pipelining, instruction sets, multi-processors,inter-connects
algorithms - concurrency, efficiency, communications
software - parallel approaches, compilers, optimization,libraries
When do I need HPC?
My problem takes too long→ more/faster computation
My problem is too big→ more memory
My data is too big→ more storage
![Page 18: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/18.jpg)
Scientific High Performance Computing
Why is it necessary?
Modern experiments and observations yield vastly more datato be processed than in the past.
As more computing resources become available, the bar forcutting edge simulations is raised.
Science that could not have been done before becomestractable.
However
Advances in clock speeds, bigger and faster memory andstorage have been lagging as compared to e.g. 10 years ago.Can no longer “just wait a year” and get a better computer.
So modern HPC means more hardware, not faster hardware.
Thus parallel programming/computing is required.
![Page 19: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/19.jpg)
Scientific High Performance Computing
Why is it necessary?
Modern experiments and observations yield vastly more datato be processed than in the past.
As more computing resources become available, the bar forcutting edge simulations is raised.
Science that could not have been done before becomestractable.
However
Advances in clock speeds, bigger and faster memory andstorage have been lagging as compared to e.g. 10 years ago.Can no longer “just wait a year” and get a better computer.
So modern HPC means more hardware, not faster hardware.
Thus parallel programming/computing is required.
![Page 20: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/20.jpg)
Analogy
HR Dilemma
Problem: job needs to get done faster
can’t hire substantially faster peoplecan hire more people
Solution:
split work up between people (divide and conquer)requires rethinking the work flow processrequires administration overheadeventually administration larger than actual work
![Page 21: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/21.jpg)
Analogy
HR Dilemma
Problem: job needs to get done faster
can’t hire substantially faster peoplecan hire more people
Solution:
split work up between people (divide and conquer)requires rethinking the work flow processrequires administration overheadeventually administration larger than actual work
![Page 22: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/22.jpg)
Analogy
HR Dilemma
Problem: job needs to get done faster
can’t hire substantially faster peoplecan hire more people
Solution:
split work up between people (divide and conquer)requires rethinking the work flow processrequires administration overheadeventually administration larger than actual work
![Page 23: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/23.jpg)
Wait, what about Moore’s Law?
(source: Transistor Count and Moore’s Law - 2008.svg, by Wgsimon, wikipedia)
![Page 24: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/24.jpg)
Wait, what about Moore’s Law?
(source: Transistor Count and Moore’s Law - 2008.svg, by Wgsimon, wikipedia)
Moore’s law
. . . describes a long-term trend in the history ofcomputing hardware. The number of transistors that canbe placed inexpensively on an integrated circuit doublesapproximately every two years.
(source: Moore’s law, wikipedia)
But. . .
Moores Law didn’t promise us clock speed.More transistors but getting hard to push clockspeed up.Power density is limiting factor.So more cores at fixed clock speed.
![Page 25: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/25.jpg)
Wait, what about Moore’s Law?
(source: Transistor Count and Moore’s Law - 2008.svg, by Wgsimon, wikipedia)
Moore’s law
. . . describes a long-term trend in the history ofcomputing hardware. The number of transistors that canbe placed inexpensively on an integrated circuit doublesapproximately every two years.
(source: Moore’s law, wikipedia)
But. . .
Moores Law didn’t promise us clock speed.More transistors but getting hard to push clockspeed up.Power density is limiting factor.So more cores at fixed clock speed.
![Page 26: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/26.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 27: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/27.jpg)
Parallel Computing
Thinking Parallel
The general idea is if one processor is good, many processors willbe better
Parallel programming is not generally trivial
Tools for automated parallelism are either highly specialized orabsent
serial algorithms/mathematics don’t always work well inparallel without modification
Parallel Programming
it’s Necessary (serial performance has peaked)
it’s Everywhere (cellphones, tablets, laptops, etc)
it’s still increasing (Sequoia 1.5 M cores, Tianhe-2 3.12Mcores)
![Page 28: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/28.jpg)
Parallel Computing
Thinking Parallel
The general idea is if one processor is good, many processors willbe better
Parallel programming is not generally trivial
Tools for automated parallelism are either highly specialized orabsent
serial algorithms/mathematics don’t always work well inparallel without modification
Parallel Programming
it’s Necessary (serial performance has peaked)
it’s Everywhere (cellphones, tablets, laptops, etc)
it’s still increasing (Sequoia 1.5 M cores, Tianhe-2 3.12Mcores)
![Page 29: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/29.jpg)
Parallel Computing
NVIDIA Serial vs Parallel Computing
https://www.youtube.com/watch?v=XcolCeWIcss
![Page 30: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/30.jpg)
Concurrency
Must have something to dofor all these cores.
Find parts of the programthat can doneindependently, andtherefore concurrently.
There must be many suchparts.
There order of executionshould not matter either.
Data dependencies limitconcurrency.
(source: http://flickr.com/photos/splorp)
![Page 31: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/31.jpg)
Parameter study: best case scenario
Aim is to get resultsfrom a model as aparameter varies.
Can run the serialprogram on eachprocessor at the sametime.
Get “more” done.
'
&
$
%
µ = 1
'
&
$
%
µ = 2
'
&
$
%
µ = 3
'
&
$
%
µ = 4
? ? ? ?
&%'$Answer
&%'$Answer
&%'$Answer
&%'$Answer
![Page 32: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/32.jpg)
Throughput
How many tasks can you do per time unit?
throughput = H =N
T
Maximizing H means that you can do as much as possible.
Independent tasks: usingP processors increases H by afactorPiiii
?����
vs.
i?����
i?����
i?����
i?����
T = NT1 T = NT1/PH = 1/T1 H = P/T1
![Page 33: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/33.jpg)
Scaling — Throughput
How a problem’s throughput scales as processor numberincreases (“strong scaling”).
In this case, linear scaling:
H ∝ P
This is Perfect scaling.
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Tas
ks p
er u
nit t
ime
P
![Page 34: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/34.jpg)
Scaling – Time
How a problem’s timing scales as processor number increases.
Measured by the time to do one unit. In this case, inverselinear scaling:
T ∝ 1/P
Again this is the ideal case, or “embarrassingly parallel”.
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8
Tim
e pe
r un
it ta
sk
P
![Page 35: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/35.jpg)
Scaling – Time
How a problem’s timing scales as processor number increases.
Measured by the time to do one unit. In this case, inverselinear scaling:
T ∝ 1/P
Again this is the ideal case, or “embarrassingly parallel”.
0.1
1
1 10
Tim
e pe
r un
it ta
sk
P
![Page 36: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/36.jpg)
Scaling – Speedup
How much faster the problem is solved as processor numberincreases.
Measured by the serial time divided by the parallel time
S =Tserial
T (P )∝ P
For embarrassingly parallel applications: Linear speed up.
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Spe
ed-u
p
P
![Page 37: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/37.jpg)
Non-ideal cases
Say we want tointegrate sometabulatedexperimental data.
Integration can besplit up, so differentregions are summedby each processor.
Non-ideal:
First need to getdata to processorAnd at the endbring together allthe sums:“reduction”
�� ��Partition data
? ? ? ?'
&
$
%
region 1
'
&
$
%
region 2
'
&
$
%
region 3
'
&
$
%
region 4
? ? ? ?�� ��Reduction
?
&%'$Answer
![Page 38: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/38.jpg)
Non-ideal cases
Parallel region⇒
Perfectly Parallel(for large N)
Serial portion ⇒
Parallel overhead ⇒�� ��Partition data
? ? ? ?'
&
$
%
region 1
'
&
$
%
region 2
'
&
$
%
region 3
'
&
$
%
region 4
? ? ? ?�� ��Reduction
?
&%'$Answer
Suppose non-parallel part const: Ts
![Page 39: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/39.jpg)
Amdahl’s law
Speed-up (without parallel overhead):
S =NT1 + Ts
NT1
P+ Ts
or, calling f = Ts/(Ts +NT1) the serial fraction,
S =1
f + (1− f)/P
P →∞−→1
f
0
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16 (for f = 5%)
Serial part dominates asymptotically.
Speed-up limited, no matter size of P .
And this is the overly optimistic case!
![Page 40: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/40.jpg)
Amdahl’s law
Speed-up (without parallel overhead):
S =NT1 + Ts
NT1
P+ Ts
or, calling f = Ts/(Ts +NT1) the serial fraction,
S =1
f + (1− f)/PP →∞−→
1
f
0
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16 (for f = 5%)
Serial part dominates asymptotically.
Speed-up limited, no matter size of P .
And this is the overly optimistic case!
![Page 41: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/41.jpg)
HPC Lesson #1
Always keep throughput in mind: if you have several runs, runningmore of them at the same time on less processors per run is often
advantageous.
![Page 42: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/42.jpg)
Trying to beat Amdahl’s law #1
Scale up!
The larger N , the smallerthe serial fraction:
f(P ) =P
N
0 5
10 15 20 25 30 35 40 45 50
5 10 15 20 25 30 35 40 45 50
Spe
ed-u
p
Number of processors P
N=100N=1,000
N=10,000N=100,000
Ideal
Weak scaling: Increase problem size while increasing P
T imeweak(P ) = T ime(N = n× P, P )
Good weak scaling means this time approaches a constant for large P .
Gustafson’s Law
Any large enough problem can be efficiently parallelized(Efficiency→1).
![Page 43: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/43.jpg)
Trying to beat Amdahl’s law #1
Scale up!
The larger N , the smallerthe serial fraction:
f(P ) =P
N
0 5
10 15 20 25 30 35 40 45 50
5 10 15 20 25 30 35 40 45 50
Spe
ed-u
p
Number of processors P
N=100N=1,000
N=10,000N=100,000
Ideal
Weak scaling: Increase problem size while increasing P
T imeweak(P ) = T ime(N = n× P, P )
Good weak scaling means this time approaches a constant for large P .
Gustafson’s Law
Any large enough problem can be efficiently parallelized(Efficiency→1).
![Page 44: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/44.jpg)
Trying to beat Amdahl’s law #1
Scale up!
The larger N , the smallerthe serial fraction:
f(P ) =P
N
0 5
10 15 20 25 30 35 40 45 50
5 10 15 20 25 30 35 40 45 50
Spe
ed-u
p
Number of processors P
N=100N=1,000
N=10,000N=100,000
Ideal
Weak scaling: Increase problem size while increasing P
T imeweak(P ) = T ime(N = n× P, P )
Good weak scaling means this time approaches a constant for large P .
Gustafson’s Law
Any large enough problem can be efficiently parallelized(Efficiency→1).
![Page 45: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/45.jpg)
Trying to beat Amdahl’s law #2
Parallel region⇒
Perfectly Parallel(for large N)
Serial portion ⇒
Parallel overhead ⇒
Rewrite
∝ log2(P )
�� ��Partition data
? ? ? ?'
&
$
%
region 1
'
&
$
%
region 2
'
&
$
%
region 3
'
&
$
%
region 4
? ? ? ?�� ��Reduction
?
&%'$Answer
�� �� �� ���� ��? ?
?
&%'$Answer
![Page 46: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/46.jpg)
Trying to beat Amdahl’s law #2
Parallel region⇒
Perfectly Parallel(for large N)
Serial portion ⇒
Parallel overhead ⇒
Rewrite
∝ log2(P )
�� ��Partition data
? ? ? ?'
&
$
%
region 1
'
&
$
%
region 2
'
&
$
%
region 3
'
&
$
%
region 4
? ? ? ?�� ��Reduction
?
&%'$Answer
�� �� �� ���� ��? ?
?
&%'$Answer
![Page 47: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/47.jpg)
Trying to beat Amdahl’s law #2
Parallel region⇒
Perfectly Parallel(for large N)
Serial portion ⇒
Parallel overhead ⇒
Rewrite
∝ log2(P )
�� ��Partition data
? ? ? ?'
&
$
%
region 1
'
&
$
%
region 2
'
&
$
%
region 3
'
&
$
%
region 4
? ? ? ?�� �� �� ���� ��? ?
?
&%'$Answer
![Page 48: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/48.jpg)
Trying to beat Amdahl’s law #2
Parallel region⇒
Perfectly Parallel(for large N)
Serial portion ⇒
Parallel overhead ⇒
Rewrite
∝ log2(P )
�� ��Partition data
? ? ? ?'
&
$
%
region 1
'
&
$
%
region 2
'
&
$
%
region 3
'
&
$
%
region 4
? ? ? ?�� �� �� ���� ��? ?
?
&%'$Answer
![Page 49: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/49.jpg)
HPC Lesson #2
Optimal Serial Algorithm for your problem maynot be the P→1 limit of your optimal
parallel algorithm.
![Page 50: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/50.jpg)
Synchronization
Most problems are notpurely concurrent.
Some level ofsynchronization or exchangeof information is neededbetween tasks.
While synchronizing,nothing else happens:increases Amdahl’s f .
And synchronizations arethemselves costly.
![Page 51: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/51.jpg)
Load Balancing
The division of calculationsamong the processors maynot be equal.
Some processors wouldalready be done, whileothers are still going.
Effectively using less thanP processors: This reducesthe efficiency.
Aim for load balancedalgorithms.
![Page 52: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/52.jpg)
Load Balancing
![Page 53: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/53.jpg)
Locality
So far we neglected communication costs.
But communication costs are more expensive thancomputation!
To minimize communication to computation ratio:
* Keep the data where it is needed.* Make sure as little data as possible is to be communicated.* Make shared data as local to the right processors as possible.
Local data means less need for syncs, or smaller-scale syncs.
Local syncs can alleviate load balancing issues.
![Page 54: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/54.jpg)
Locality
![Page 55: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/55.jpg)
Domain Decomposition
Domain Decomposition http://www.uea.ac.uk/cmp/research/cmpbio/
Protein+Dynamics,+Structure+and+Function
http://sivo.gsfc.nasa.gov/cubedsphere_comp.html
http://adg.stanford.edu/aa241/design/compaero.html
http://www.cita.utoronto.ca/~dubinski/treecode/node8.html
• A very common approach to parallelizing on distributed memory computers
• Maintain Locality; need local data mostly, this means only surface data needs to be sent between processes.
![Page 56: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/56.jpg)
Domain Decomposition
Guardcells• Works for parallel
decomposition!
• Job 1 needs info on Job 2s 0th zone, Job 2 needs info on Job 1s last zone
• Pad array with ‘guardcells’ and fill them with the info from the appropriate node by message passing or shared memory
n-4 n-3 n-2 n-1 n
-1 0 1 2 3
Global Domain
Job 1
Job 2
![Page 57: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/57.jpg)
Locality
Example (PDE Domain decomposition)
wrong right
![Page 58: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/58.jpg)
HPC Lesson #3
Parallel algorithm design is about finding asmuch concurrency as possible, and arranging
it in a way that maximizes locality.
![Page 59: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/59.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 60: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/60.jpg)
HPC Systems
Top500.org:
List of the worlds500 largestsupercomputers.Updated every 6months,
Info onarchitecture, etc.
![Page 61: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/61.jpg)
HPC Systems
Architectures
Clusters, or, distributed memory machines
A bunch of servers linked together by a network(“interconnect”).commodity x86 with gigE, Cray XK, IBM BGQ
Symmetric Multiprocessor (SMP) machines, or, sharedmemory machines
These can all see the same memory, typically a limited numberof cores.IBM Pseries, Cray SMT, SGI Altix/UV
Vector machines.
No longer dominant in HPC anymore.Cray, NEC
Accelerator (GPU, Cell, MIC, FPGA)
Heterogeneous use of standard CPU’s with a specializedaccelerator.NVIDIA, AMD, Intel Phi, Xilinx, Altera
![Page 62: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/62.jpg)
Distributed Memory: Clusters
Simplest type of parallel com-puter to build
Take existing powerfulstandalone computers
And network them
(source: http://flickr.com/photos/eurleif)
![Page 63: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/63.jpg)
Distributed Memory: Clusters
Each node isindependent!Parallel code consists ofprograms running onseparate computers,communicating witheach other.Could be entirelydifferent programs.
Each node has ownmemory!Whenever it needs datafrom another region,requests it from thatCPU.
Usual model: “message passing”
~ ~~
~
n nn
n
CPU1
CPU2
CPU3
CPU4
��������������
���
�*
6
�����
�������������
�����
?
��
��
![Page 64: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/64.jpg)
Distributed Memory: Clusters
Each node isindependent!Parallel code consists ofprograms running onseparate computers,communicating witheach other.Could be entirelydifferent programs.
Each node has ownmemory!Whenever it needs datafrom another region,requests it from thatCPU.
Usual model: “message passing”
~ ~~
~
n nn
n
CPU1
CPU2
CPU3
CPU4
��������������
���
�*
6
�����
�������������
�����
?
��
��
![Page 65: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/65.jpg)
Clusters+Message Passing
Hardware:Easy to build(Harder to build well)Can build larger andlarger clusters relativelyeasily
Software:Every communicationhas to be hand-coded:hard to program
~ ~~
~
n nn
n
CPU1
CPU2
CPU3
CPU4
��������������
���
�*
6
�����
�������������
�����
?
��
��
![Page 66: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/66.jpg)
Task (function, control) Parallelism
Work to be done is decomposed across processors
e.g. divide and conquer
each processor responsible for some part of the algorithm
communication mechanism is significant
must be possible for different processors to be performingdifferent tasks
![Page 67: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/67.jpg)
Cluster Communication Cost
Latency Bandwidth
GigE 10 µs 1 Gb/s(10,000 ns) ( 60 ns/double)
Infiniband 2 µs 2-10 Gb/s(2,000 ns) ( 10 ns /double)
Processor speed: O(GFLOP) ∼ few ns or less.
![Page 68: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/68.jpg)
Cluster Communication Cost
![Page 69: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/69.jpg)
SciNet General Purpose Cluster (GPC)
3864 nodes with two 2.53GHz quad-core Intel Xeon5500 (Nehalem) x86-64 processors (30240 corestotal)
16GB RAM per node
Gigabit ethernet network on all nodes formanagement and boot
DDR and QDR InfiniBand network on the nodes forjob communication and file I/O
306 TFlops
debuted at #16 on the June 2009 TOP500supercomputer list
currently (Jul. 2016) #552
![Page 70: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/70.jpg)
SciNet General Purpose Cluster (GPC)
3864 nodes with two 2.53GHz quad-core Intel Xeon5500 (Nehalem) x86-64 processors (30240 corestotal)
16GB RAM per node
Gigabit ethernet network on all nodes formanagement and boot
DDR and QDR InfiniBand network on the nodes forjob communication and file I/O
306 TFlops
debuted at #16 on the June 2009 TOP500supercomputer list
currently (Jul. 2016) #552
![Page 71: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/71.jpg)
Shared Memory
One large bank ofmemory, differentcomputing cores actingon it. All ‘see’ samedata.
Any coordination donethrough memory
Could use messagepassing, but no need.
Each code is assigned athread of execution of asingle program that actson the data.
~ ~
~
~
n n
n
n
-� � -
?
6
6
?
Core 1 Core 2
Core 3
Core 4
Memory
![Page 72: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/72.jpg)
Threads versus Processes
Threads:Threads of execution withinone process, with access to thesame memory etc.
Processes:Independent tasks with theirown memory and resources
![Page 73: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/73.jpg)
Shared Memory: NUMA
Non-Uniform Memory Access
Each core typically hassome memory of its own.
Cores have cache too.
Keeping this memorycoherent is extremelychallenging.
~ ~
~
~
Memoryn n
n
n
![Page 74: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/74.jpg)
Coherency
The different levels ofmemory imply multiplecopies of some regions
Multiple cores mean canupdate unpredictably
Very expensive hardware
Hard to scale up to lots ofprocessors.
Very simple to program!!
~ ~
~
~
Memoryn n
n
n
x[10] = 5
x[10] =?
![Page 75: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/75.jpg)
Data (Loop) Parallelism
Data is distributed across processors
easier to program, compiler optimization
code otherwise looks fairly sequential
benefits from minimal communication overhead
scale limitations
![Page 76: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/76.jpg)
Shared Memory Communication Cost
Latency Bandwidth
GigE 10 µs 1 Gb/s(10,000 ns) ( 60 ns/double)
Infiniband 2 µs 2-10 Gb/s(2,000 ns) ( 10 ns /double)
NUMA 0.1 µs 10-20 Gb/s(shared memory) (100 ns) ( 4 ns /double)
Processor speed: O(GFLOP) ∼ few ns or less.
![Page 77: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/77.jpg)
SciNet Tightly Coupled System (TCS)
104 nodes with 16x 4.7GHz dual-core IBM Power6processors (3328 cores total)
128GB RAM per node
4x DDR InfiniBand network on the nodes for jobcommunication and file I/O
62 TFlops
![Page 78: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/78.jpg)
SciNet Tightly Coupled System (TCS)
104 nodes with 16x 4.7GHz dual-core IBM Power6processors (3328 cores total)
128GB RAM per node
4x DDR InfiniBand network on the nodes for jobcommunication and file I/O
62 TFlops
![Page 79: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/79.jpg)
Hybrid Architectures
Use shared and distributedmemory together (i.e.OpenMP with MPI).
Need to exploit multi-levelparallelism.
Homogeneous
Identical multicoremachines linked togetherwith an interconnect.Many cores have modestvector capabilities.Thread on-node, MPI foroff-node.
Heterogeneous
Same as above, but withan accelerator as well.GPU, Xeon Phi, FPGA.
![Page 80: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/80.jpg)
Heterogeneous Computing
What is it?
Use different compute device(s) concurrently in the samecomputation.
Commonly using a CPU with an accelator: GPU, Xeon Phi,FPGA, ...
Example: Leverage CPUs for general computing componentsand use GPU’s for data parallel / FLOP intensive components.
Pros: Faster and cheaper ($/FLOP/Watt) computation
Cons: More complicated to program
![Page 81: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/81.jpg)
Heterogeneous Computing
Terminology
GPGPU : General Purpose Graphics Processing Unit
HOST : CPU and its memory
DEVICE : Accelerator (GPU/Phi) and its memory
![Page 82: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/82.jpg)
GPU vs. CPUs
CPUgeneral purpose
task parallelism (diverse tasks)
maximize serial performance
large cache
multi-threaded (4-16)
some SIMD (SSE, AVX)
GPUdata parallelism (single task)
maximize throughput
small cache
super-threaded (500-2000+)
almost all SIMD
![Page 83: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/83.jpg)
GPGPU
http://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html
![Page 84: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/84.jpg)
Xeon Phi
What is it?
Intel x86 based Accelerator/Co-processor
Many Integrated Cores (MIC) Architecture
Large number of low-powered, but low cost (computationaloverhead, power, size, monetary cost) processors (pentiums).
Xeon Phi 5110P (Knights Corner)
60 cores @ 1.053GHz
8 GB memory
4-way SMT (240 threads)
PCIe Gen2 bus connectivity
Runs linux onboard
NOTE: Next Gen Knights Landing: 72-core native processor
![Page 85: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/85.jpg)
Xeon Phi
What is it?
Intel x86 based Accelerator/Co-processor
Many Integrated Cores (MIC) Architecture
Large number of low-powered, but low cost (computationaloverhead, power, size, monetary cost) processors (pentiums).
Xeon Phi 5110P (Knights Corner)
60 cores @ 1.053GHz
8 GB memory
4-way SMT (240 threads)
PCIe Gen2 bus connectivity
Runs linux onboard
NOTE: Next Gen Knights Landing: 72-core native processor
![Page 86: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/86.jpg)
Heterogeneous Speedup
What kind of speedup can I expect?
∼1 TFLOPs per GPU vs. ∼100 GFLOPs multi-core CPU
0x - 50x reported
Speedup depends on
problem structure
need many identical independent calculationspreferably sequential memory access
single vs. double precision (K20 3.52 TF SP vs 1.17 TF DP)
data locality
level of intimacy with hardware
programming time investment
![Page 87: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/87.jpg)
Heterogeneous Speedup
What kind of speedup can I expect?
∼1 TFLOPs per GPU vs. ∼100 GFLOPs multi-core CPU
0x - 50x reported
Speedup depends on
problem structure
need many identical independent calculationspreferably sequential memory access
single vs. double precision (K20 3.52 TF SP vs 1.17 TF DP)
data locality
level of intimacy with hardware
programming time investment
![Page 88: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/88.jpg)
Accelerator Programming
Languages
GPGPU Only
OpenGL, DirectX (Graphics only)CUDA (NVIDIA proprietary)
OpenCL (1.0, 1.1, 2.0)
OpenACC
OpenMP 4.0
![Page 89: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/89.jpg)
Compute Canada GPU Resources
Westgrid: Parallel
60 nodes (3x NVIDIA M2070)
SharcNet: Monk
54 nodes (2x NVIDIA M2070)
SciNet: Gravity, ArcX
49 nodes (2x NVIDIA M2090)1 node (1x NVIDIA K20, 1x Xeon Phi 3120A)
CalcuQuebec: Guillimin
50 nodes (2x NVIDIA K20)
![Page 90: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/90.jpg)
Compute Canada Xeon Phi Resources
SciNet - ArcX
1 node (1 x 8-core Sandybridge Xeon, 32GB)
1 x Intel Xeon Phi 3120A (57 1.1 GHz cores and 6GB)
qsub -l nodes=1:ppn=8,walltime=2:00:00 -q arcX -I
module load intel/14.0.1 intelmpi/4.1.2.040
Calcu Quebec - Guillimin
50 nodes (2 x 8-core Intel Sandy Bridge Xeon, 64GB)
2 x Intel Xeon Phi 5110P (60 1.053GHz cores and 8GB)
![Page 91: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/91.jpg)
HPC Lesson #4
The best approach to parallelizing yourproblem will depend on both details of your
problem and of the hardware available.
![Page 92: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/92.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 93: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/93.jpg)
Program Structure
Structure of the problem dictates the ease with which we canimplement parallel solutions easy
![Page 94: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/94.jpg)
Parallel Granularity
Granularity
A measure of the amount of processing performed beforecommunication between processes is required.
Parallelism
Fine Grained
constant communication necessarybest suited to shared memory environments
Coarse Grained
significant computation performed before communication isnecessaryideally suited to message-passing environments
Perfect
no communication necessary
![Page 95: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/95.jpg)
HPC Programming Models
Languages
serial
C, C++, Fortran
threaded (shared memory)
OpenMP, pthreads
message passing (distributed memory)
MPI, PGAS (UPC, Coarray Fortran)
accelerator (GPU, Cell, MIC, FPGA)
CUDA, OpenCL, OpenACC
![Page 96: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/96.jpg)
HPC Systems
HPC Software Stack
Typically GNU/Linux
non-interactive batch processing using a queuing systemscheduler
software packages and versions usually available as “modules”
Parallel filesystem (GPFS,Lustre)
![Page 97: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/97.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 98: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/98.jpg)
Serial Jobs
SciNet is primarily a parallel computing resource. (Parallelhere means OpenMP and MPI, not many serial jobs.)
You should never submit purely serial jobs to the GPC queue.The scheduling queue gives you a full 8-core node. Per-nodescheduling of serial jobs would mean wasting 7 cpus.
Nonetheless, if you can make efficient use of the resourcesusing serial runs and get good science done, that’s good too.
Users need to utilize whole nodes by running at least 8 serialruns at once.
![Page 99: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/99.jpg)
Serial Jobs
SciNet is primarily a parallel computing resource. (Parallelhere means OpenMP and MPI, not many serial jobs.)
You should never submit purely serial jobs to the GPC queue.The scheduling queue gives you a full 8-core node. Per-nodescheduling of serial jobs would mean wasting 7 cpus.
Nonetheless, if you can make efficient use of the resourcesusing serial runs and get good science done, that’s good too.
Users need to utilize whole nodes by running at least 8 serialruns at once.
![Page 100: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/100.jpg)
Serial Jobs
SciNet is primarily a parallel computing resource. (Parallelhere means OpenMP and MPI, not many serial jobs.)
You should never submit purely serial jobs to the GPC queue.The scheduling queue gives you a full 8-core node. Per-nodescheduling of serial jobs would mean wasting 7 cpus.
Nonetheless, if you can make efficient use of the resourcesusing serial runs and get good science done, that’s good too.
Users need to utilize whole nodes by running at least 8 serialruns at once.
![Page 101: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/101.jpg)
Serial Jobs
SciNet is primarily a parallel computing resource. (Parallelhere means OpenMP and MPI, not many serial jobs.)
You should never submit purely serial jobs to the GPC queue.The scheduling queue gives you a full 8-core node. Per-nodescheduling of serial jobs would mean wasting 7 cpus.
Nonetheless, if you can make efficient use of the resourcesusing serial runs and get good science done, that’s good too.
Users need to utilize whole nodes by running at least 8 serialruns at once.
![Page 102: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/102.jpg)
Easy case: serial runs of equal duration
#PBS -l nodes=1:ppn=8,walltime=1:00:00
cd $PBS O WORKDIR
(cd rundir1; ./dorun1) &
(cd rundir2; ./dorun2) &
(cd rundir3; ./dorun3) &
(cd rundir4; ./dorun4) &
(cd rundir5; ./dorun5) &
(cd rundir6; ./dorun6) &
(cd rundir7; ./dorun7) &
(cd rundir8; ./dorun8) &
wait # or all runs get killed immediately
![Page 103: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/103.jpg)
Hard case: serial runs of unequal duration
Different runs may not take the same time: load imbalance.
Want to keep all 8 cores on a node busy.
Or even 16 virtual cores on a node (HyperThreading).
⇒ GNU Parallel can do this
![Page 104: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/104.jpg)
Hard case: serial runs of unequal duration
Different runs may not take the same time: load imbalance.
Want to keep all 8 cores on a node busy.
Or even 16 virtual cores on a node (HyperThreading).
⇒ GNU Parallel can do this
![Page 105: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/105.jpg)
GNU Parallel
GNU parallel is a a tool to run multiple (serial) jobs in parallel.As parallel is used within a GPC job, we’ll call these subjobs.
It allows you to keep the processors on each 8-core node busy,if you provide enough subjobs.
GNU Parallel can use multiple nodes as well.
On the GPC cluster:
GNU parallel is accessible on the GPC in the modulegnu-parallel, which you can load in your .bashrc.
$ module load gnu-parallel
![Page 106: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/106.jpg)
GNU Parallel Example
SETUP
A serial c++ code ’mycode.cc’ needs to be compiled.
It needs to be run 32 times with different parameters, 1through 32.
The parameters are given as a command line argument.
8 subjobs of this code fit into the GPC compute nodes’smemory.
Each serial run on average takes ∼ 2 hour.
![Page 107: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/107.jpg)
GNU Parallel Example
SETUP
A serial c++ code ’mycode.cc’ needs to be compiled.
It needs to be run 32 times with different parameters, 1through 32.
The parameters are given as a command line argument.
8 subjobs of this code fit into the GPC compute nodes’smemory.
Each serial run on average takes ∼ 2 hour.
![Page 108: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/108.jpg)
GNU Parallel Example
SETUP
A serial c++ code ’mycode.cc’ needs to be compiled.
It needs to be run 32 times with different parameters, 1through 32.
The parameters are given as a command line argument.
8 subjobs of this code fit into the GPC compute nodes’smemory.
Each serial run on average takes ∼ 2 hour.
![Page 109: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/109.jpg)
GNU Parallel Example
SETUP
A serial c++ code ’mycode.cc’ needs to be compiled.
It needs to be run 32 times with different parameters, 1through 32.
The parameters are given as a command line argument.
8 subjobs of this code fit into the GPC compute nodes’smemory.
Each serial run on average takes ∼ 2 hour.
![Page 110: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/110.jpg)
GNU Parallel Example
SETUP
A serial c++ code ’mycode.cc’ needs to be compiled.
It needs to be run 32 times with different parameters, 1through 32.
The parameters are given as a command line argument.
8 subjobs of this code fit into the GPC compute nodes’smemory.
Each serial run on average takes ∼ 2 hour.
![Page 111: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/111.jpg)
GNU Parallel Example
$
cd $SCRATCH/example
$ module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 112: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/112.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$
module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 113: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/113.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$ module load intel
$
icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 114: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/114.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$ module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$
cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 115: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/115.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$ module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$
cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 116: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/116.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$ module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$
qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 117: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/117.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$ module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$
ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 118: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/118.jpg)
GNU Parallel Example
$ cd $SCRATCH/example
$ module load intel
$ icpc -O3 -xhost mycode.cc -o myapp
$ cat > subjob.lst
mkdir run01; cd run01; ../myapp 1 > outmkdir run02; cd run02; ../myapp 2 > out. . .mkdir run32; cd run32; ../myapp 32 > out
$ cat > GPJob
#PBS -l nodes=1:ppn=8,walltime=12:00:00
cd $SCRATCH/examplemodule load intel gnu-parallel
parallel --jobs 8 < subjob.lst
$ qsub GPJob
2961985.gpc-sched
$ ls
GPJob GPJob.e2961985 GPJob.o2961985 subjob.lstmyapp run01 run02 run03...
![Page 119: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/119.jpg)
GNU Parallel Example
17 hours42% utilization
⇒10 hours
72% utilization
![Page 120: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/120.jpg)
GNU Parallel Example
17 hours42% utilization
⇒10 hours
72% utilization
![Page 121: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/121.jpg)
GNU Parallel Details
What else can it do?
Recover from crashes (joblog/resume options)
Span multiple nodes
Using GNU Parallel
wiki.scinethpc.ca/wiki/index.php/User Serial
wiki.scinethpc.ca/wiki/images/7/7b/Tech-talk-gnu-parallel.pdf
www.gnu.org/software/parallel
www.youtube.com/playlist?list=PL284C9FF2488BC6D1
O. Tange, GNU Parallel – The Command-Line Power Tool,;login: The USENIX Magazine, February 2011:42-47.
![Page 122: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/122.jpg)
Outline
1 SciNet
2 HPC Overview
3 Parallel ComputingAmdahl’s lawBeating Amdahl’s lawLoad BalancingLocality
4 HPC HardwareDistributed MemoryShared MemoryHybrid ArchitecturesHeterogeneous ArchitecturesSoftware
5 HPC Programming Models & Software
6 Serial Jobs : GNU Parallel
7 Useful Sites
![Page 123: High Performance Computing (HPC) Introductionnolta/HPC_intro.pdf · 5 HPC Programming Models & Software 6 Serial Jobs : GNU Parallel 7 Useful Sites. Acknowledgments Contributing Material](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2e0079fb56c7fb11f607d/html5/thumbnails/123.jpg)
www.scinethpc.ca wiki.scinethpc.ca
https://support.scinet.utoronto.ca/educationhttps://portal.scinet.utoronto.ca
SciNet usage reportsChange password, defaultallocation, maillist sub-scriptions