advanced topics in numerical analysis: high performance ...stadler/hpc17/material/hpc17_3.pdf ·...
TRANSCRIPT
Advanced Topics in Numerical Analysis:High Performance Computing
MATH-GA 2012.001 & CSCI-GA 2945.001
Georg StadlerCourant Institute, [email protected]
Spring 2017, Thursday, 5:10–7:00PM, WWH #512
Feb. 24, 2017
1 / 48
Outline
Organization issues
Last class summary
Why writing fast code isn’t easy
Git—repository systems
Parallel machines and programming models
Shared memory parallelism–OpenMP
2 / 48
Organization issues
I Please register for an XSEDE account such that Bill can addyou to the Stampede allocation (see his post on Piazza).You’re welcome to do that even if you only audit the course
I Please fill out the poll for makeup class either on March 6 or 7
I Next week, my colleague Marsha Berger will fill in for me
I And yes, there will be new homework at some point. . .
3 / 48
Outline
Organization issues
Last class summary
Why writing fast code isn’t easy
Git—repository systems
Parallel machines and programming models
Shared memory parallelism–OpenMP
4 / 48
Memory hierarchiesOn my Mac Book Pro: 32KB L1 Cache, 256KB L2 Cache, 3MB Cache, 8GB RAM
CPU: O(1ns), L2/L3: O(10ns), RAM: O(100ns), disc: O(10ms)
5 / 48
Memory hierarchies
Important terms:
I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)
I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);
I cache-hit: required data is available in cache ⇒ fast access
I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access
Computer architecture is complicated. We need a basicperformance model.
I Processor needs to be “fed” with data to work on.
I Memory access is slow; memory hierarchies help.
6 / 48
Memory hierarchies
Important terms:
I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)
I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);
I cache-hit: required data is available in cache ⇒ fast access
I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access
Computer architecture is complicated. We need a basicperformance model.
I Processor needs to be “fed” with data to work on.
I Memory access is slow; memory hierarchies help.
6 / 48
Memory hierarchies
Important terms:
I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)
I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);
I cache-hit: required data is available in cache ⇒ fast access
I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access
Computer architecture is complicated. We need a basicperformance model.
I Processor needs to be “fed” with data to work on.
I Memory access is slow; memory hierarchies help.
6 / 48
Memory hierarchies
Important terms:
I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)
I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);
I cache-hit: required data is available in cache ⇒ fast access
I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access
Computer architecture is complicated. We need a basicperformance model.
I Processor needs to be “fed” with data to work on.
I Memory access is slow; memory hierarchies help.
6 / 48
Memory hierarchies
Important terms:
I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)
I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);
I cache-hit: required data is available in cache ⇒ fast access
I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access
Computer architecture is complicated. We need a basicperformance model.
I Processor needs to be “fed” with data to work on.
I Memory access is slow; memory hierarchies help.
6 / 48
Memory hierarchySimple model
1. Only consider two levels in hierarchy, fast (cache) and slow(RAM) memory
2. All data is initially in slow memory
3. Simplifications:I Ignore that memory access and arithmetic operations can
happen at the same timeI assume time for access to fast memory is 0
4. Computational intensity: flops per slow memory access
q =f
m, where f . . .#flops,m . . .#slow memop.
Computational intensity should be as large as possible.
7 / 48
Memory hierarchy
Example: Matrix-matrix multiply Comparison between naive andblocked optimized matrix-matrix multiplication for different matrixsizes: Different algorithms can increase the computational intensitysignificantly.BLAS: Optimized Basic Linear Algebra Subprograms
I Temporal and spatial locality is key for fast performance.
I Since arithmetic is cheap compared to memory access, one canconsider making extra flops if it reduces the memory access.
I In distributed-memory parallel computations, the memoryhierarchy is extended to data stored on other processors, whichis only available through communication over the network.
8 / 48
Memory hierarchy
Example: Matrix-matrix multiply Comparison between naive andblocked optimized matrix-matrix multiplication for different matrixsizes: Different algorithms can increase the computational intensitysignificantly.BLAS: Optimized Basic Linear Algebra Subprograms
I Temporal and spatial locality is key for fast performance.
I Since arithmetic is cheap compared to memory access, one canconsider making extra flops if it reduces the memory access.
I In distributed-memory parallel computations, the memoryhierarchy is extended to data stored on other processors, whichis only available through communication over the network.
8 / 48
Outline
Organization issues
Last class summary
Why writing fast code isn’t easy
Git—repository systems
Parallel machines and programming models
Shared memory parallelism–OpenMP
9 / 48
Levels of parallelismI Parallelism at the bit level (64-bit
operations)
I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle
I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .
all of the above assume single sequential control flow
I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow
10 / 48
Levels of parallelismI Parallelism at the bit level (64-bit
operations)
I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle
I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .
all of the above assume single sequential control flow
I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow
10 / 48
Levels of parallelismI Parallelism at the bit level (64-bit
operations)
I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle
I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .
all of the above assume single sequential control flow
I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow
10 / 48
Levels of parallelismI Parallelism at the bit level (64-bit
operations)
I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle
I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .
all of the above assume single sequential control flow
I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow
10 / 48
Amdahl’s lawIs there enough parallelism in my problem?
Suppose only part of the application is parallelAmdahl’s law:
I Let s be the fraction of work done sequentially, and (1− s)the part that is done in parallel
I p. . . number of parallel processor (cores).
Speedup:
time(1 proc)
time(p proc)≤ 1
s+ (1− s)/p≤ 1
s
Thus: Performance is limited bysequential part.
Source: Wikipedia
11 / 48
Load (im)balance in parallel computations
In parallel computations, the work should be distributed evenlyacross workers/processors.
I Load imbalance: Idle time due to insufficient parallelism orunequal sized tasks
I Initial/static load balancing: distribution of work at beginningof computation
I Dynamic load balancing: work load needs to be re-balancedduring computation. Imbalance can occur, e.g., due to
I adapting (mesh refinement)I in completely unstructured problems
12 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
Weak scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
Weak scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
Weak scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Parallel scalabilityStrong and weak scaling/speedup
Strong scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
tim
e=
1/effi
cien
cy
Weak scalability
work
cputime
1 2 4 8 16 320%
20%
40%
60%
80%
100%
no of procs
effici
ency
13 / 48
Locality and parallelism
Locality of data to processing unit is critical in general, and evenmore so on parallel computers.
I Memory hierarchies even deeper onparallel computers.
I In parallel computing, data oftenmust be communicated throughthe network, which is slow. Again,locality is key.
I Computation should mostly be onlocal (cache-local, in DRAM,processor-local, disc-local) data
14 / 48
Outline
Organization issues
Last class summary
Why writing fast code isn’t easy
Git—repository systems
Parallel machines and programming models
Shared memory parallelism–OpenMP
15 / 48
Why Use Version Control?Slides adapted from Andreas Skielboe
A Version Control System (VCS) is an integrated fool-proofframework for
I Backup and Restore
I Short and long-term undo
I Tracking changes
I Synchronization
I Collaborating
I Sandboxing
... with minimal overhead.
16 / 48
Local Version Control Systems
Conventional version control systems provides some of thesefeatures by making a local database with all changes made to files.
Any file can be recreated by getting changes from the databaseand patch them up.
17 / 48
Centralized Version Control Systems
To enable synchronization and collaborative features the databaseis stored on a central VCS server, where everyone works in thesame database.
Introduces problems: single point of failure, inability to workoffline.
18 / 48
Distributed Version Control Systems
To overcome problems related to centralization, distributed VCSs(DVCSs) were invented. Keeping a complete copy of database inevery working directory.
Actually the most simple and most powerful implementation ofany VCS.
19 / 48
Git Basics - The Git Workflow
The simplest use of Git:
I Modify files in your working directory.
I Stage the files, adding snapshots of them to your stagingarea.
I Commit, takes files in the staging area and stores thatsnapshot permanently to your Git directory.
20 / 48
Git Basics - The Three States
The three basic states of files in your Git repository:
21 / 48
Git Basics - Commits
Each commit in the git directory holds a snapshot of the files thatwere staged and thus went into that commit, along with authorinformation.
Each and every commit can always be looked at and retrieved.
22 / 48
Git Basics - File Status Lifecycle
Files in your working directory can be in four different states inrelation to the current commit.
23 / 48
Git Basics - Working with remotes
In Git all remotes are equal.
A remote in Git is nothing more than a link to another git directory.
24 / 48
Git Basics - Working with remotes
The easiest commands to get started working with a remote are
I clone: Cloning a remote will make a complete local copy.
I pull: Getting changes from a remote.
I push: Sending changes to a remote.
Fear not! We are starting to get into more advanced topics. Solets look at some examples.
25 / 48
Git Basics - Advantages
Basic advantages of using Git:
I Nearly every operation is local.
I Committed snapshots are always kept.
I Strong support for non-linear development.
26 / 48
Hands-on - First-Time Git Setup
Before using Git for the first time:
Pick your identity
$ git config --global user.name "John Doe"
$ git config --global user.email [email protected]
Check your settings
$ git config --list
Get help
$ git help <verb>
27 / 48
Hands-on - Getting started with a bare remote server
Using a Git server (ie. no working directory / bare repository) isthe analogue to a regular centralized VCS in Git.
28 / 48
Hands-on - Getting started with remote server
When the remote server is set up with an initialized Git directoryyou can simply clone the repository:
Cloning a remote repository
$ git clone <repository>
You will then get a complete local copy of that repository, whichyou can edit.
29 / 48
Hands-on - Getting started with remote server
With your local working copy you can make any changes to thefiles in your working directory as you like. When satisfied with yourchanges you add any modified or new files to the staging areausing add:
Adding files to the staging area
$ git add <filepattern>
30 / 48
Hands-on - Getting started with remote server
Finally to commit the files in the staging area you run commitsupplying a commit message.
Committing staging area to the repository
$ git commit -m <msg>
Note that so far everything is happening locally in your workingdirectory.
31 / 48
Hands-on - Getting started with remote server
To share your commits with the remote you invoke the pushcommand:
Pushing local commits to the remote
$ git push
To recieve changes that other people have pushed to the remoteserver you can use the pull command:
Pulling remote commits to the local working directory
$ git pull
And thats it.
32 / 48
Hands-on - Summary
Summary of a minimal Git workflow:
I clone remote repository
I add you changes to the staging area
I commit those changes to the git directory
I push your changes to the remote repository
I pull remote changes to your local working directory
33 / 48
More advanced topics
Git is a powerful and flexible DVCS. Some very useful, but a bitmore advanced features include:
I Branching
I Merging
I Tagging
I Rebasing
34 / 48
References
Some good Git sources for information:
I Git Community Book - http://book.git-scm.com/
I Pro Git - http://progit.org/
I Git Reference - http://gitref.org/
I GitHub - http://github.com/
I Git from the bottom up - http://ftp.newartisans.com/pub/git.from.bottom.up.pdf
I Understanding Git Conceptually -http://www.eecs.harvard.edu/~cduan/technical/git/
I Git Immersion - http://gitimmersion.com/
35 / 48
Applications
GUIs for Git:
I GitX (MacOS) - http://gitx.frim.nl/
I Giggle (Linux) - http://live.gnome.org/giggle
36 / 48
Outline
Organization issues
Last class summary
Why writing fast code isn’t easy
Git—repository systems
Parallel machines and programming models
Shared memory parallelism–OpenMP
37 / 48
Parallel architectures (Flynn’s taxonomy)
Characterization of architectures according to Flynn:
SISD: Single instruction, single data. This is the conventionalsequential model.
SIMD: Single instruction, multiple data. Multiple processing unitswith identical instructions, each one working on different data.Useful when a lot of completely identical tasks are needed.
MIMD: Multiple instructions, multiple data. Multiple processing unitswith separate (but often similar) instructions anddata/memory access (shared or distributed). We will mainlyuse this approach.
MISD: Multiple instructions, single data. Not practical.
38 / 48
Programming model must reflect architecture
Example: Inner product between two (very long) vectors: aT b:
I Where are a, b stored? Single memory or distributed?
I What work should be done by which processor?
I How do they coordinate their result?
39 / 48
Shared memory programming model
I Program is a collection ofcontrol threads, that arecreated dynamically
I Each thread has private andshared variables
I Threads can exchange data byreading/writing shared variables
I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition
Programming model must manage different threads and avoid raceconditions.OpenMP: Open Multi-Processing is the application interface (API)that supports shared memory parallelism: www.openmp.org
40 / 48
Shared memory programming model
I Program is a collection ofcontrol threads, that arecreated dynamically
I Each thread has private andshared variables
I Threads can exchange data byreading/writing shared variables
I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition
Programming model must manage different threads and avoid raceconditions.OpenMP: Open Multi-Processing is the application interface (API)that supports shared memory parallelism: www.openmp.org
40 / 48
Distributed memory programming model
I Program is run by a collectionof named processes; fixed atstart-up
I Local address space; no shareddata
I logically shared data isdistributed (e.g., everyprocessor only has direct accessto a chunk of rows of a matrix)
I Explicit communication throughsend/receive pairs
Programming model must accommodate communication.
MPI: Massage Passing Interface (different implementations: LAM,Open-MPI, Mpich, Mvapich), http://www.mpi-forum.org/
41 / 48
Distributed memory programming model
I Program is run by a collectionof named processes; fixed atstart-up
I Local address space; no shareddata
I logically shared data isdistributed (e.g., everyprocessor only has direct accessto a chunk of rows of a matrix)
I Explicit communication throughsend/receive pairs
Programming model must accommodate communication.
MPI: Massage Passing Interface (different implementations: LAM,Open-MPI, Mpich, Mvapich), http://www.mpi-forum.org/
41 / 48
Hybrid distributed/shared programming model
I Pure MPI approach splits the memory of a multicore processorinto independent memory pieces, and uses MPI to exchangeinformation between them.
I Hybrid approach uses MPI across processors, and OpenMP forprocessor cores that have access to the same memory.
I A similar hybrid approach is also used for hybrid architectures,i.e., computers that contain CPUs and accelerators (GPGPUs,MICs).
42 / 48
Other parallel programming approaches
I Grid computing: loosely coupled problems, most famousexample was SETI@Home.
I MapReduce: Introduced by Google; targets large data setswith parallel, distributed algorithms on a cluster.
I WebCL
I Pthreads
I CUDA
I Cilk
I . . .
43 / 48
Outline
Organization issues
Last class summary
Why writing fast code isn’t easy
Git—repository systems
Parallel machines and programming models
Shared memory parallelism–OpenMP
44 / 48
Shared memory programming model
I Program is a collection ofcontrol threads, that arecreated dynamically
I Each thread has private andshared variables
I Threads can exchange data byreading/writing shared variables
I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition
Only one process is running, which can fork into shared memorythreads.
45 / 48
Threads versus process
I A process is an independent execution unit, which containstheir own state information (pointers to instruction andstack). One process can contain several threads.
I Threads within a process share the same address space, andcommunicate directly using shared variables. Seperate stackbut shared heap memory.
I Stack memory: Used for temporarily storing data; fast;last-in-first-out principle. Examples int a=2; double
b=2.11; etc; no deallocation necessary; small size; static.
I Heap memory: Not managed automatically, manuallyallocate/de-allocate/re-allocate; slower; larger;
I Using several threads can also be useful on a single processor(“multithreading”), depending on the memory latency.
46 / 48
Shared memory programming
I POSIX Threads (Pthreads) library; more intrusive thanOpenMP.
I PGAS languages: partitioned global address space: logicallypartitioned but can be programmed like a global memoryaddress space (communication is taken care of in thebackground)
I OpenMP: open multi-processing is a light-weight applicationinterface (API), that supports shared memory parallelism.
47 / 48
Shared memory programming—Literature
OpenMP standard online:www.openmp.org
Very useful online course:www.youtube.com/user/OpenMPARB/
Recommended reading:
Chapter 6 in Chapter 7 in
48 / 48