advanced topics in numerical analysis: high performance ...stadler/hpc17/material/hpc17_3.pdf ·...

65
Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001 & CSCI-GA 2945.001 Georg Stadler Courant Institute, NYU [email protected] Spring 2017, Thursday, 5:10–7:00PM, WWH #512 Feb. 24, 2017 1 / 48

Upload: others

Post on 31-Aug-2019

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Advanced Topics in Numerical Analysis:High Performance Computing

MATH-GA 2012.001 & CSCI-GA 2945.001

Georg StadlerCourant Institute, [email protected]

Spring 2017, Thursday, 5:10–7:00PM, WWH #512

Feb. 24, 2017

1 / 48

Page 2: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

2 / 48

Page 3: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Organization issues

I Please register for an XSEDE account such that Bill can addyou to the Stampede allocation (see his post on Piazza).You’re welcome to do that even if you only audit the course

I Please fill out the poll for makeup class either on March 6 or 7

I Next week, my colleague Marsha Berger will fill in for me

I And yes, there will be new homework at some point. . .

3 / 48

Page 4: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

4 / 48

Page 5: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchiesOn my Mac Book Pro: 32KB L1 Cache, 256KB L2 Cache, 3MB Cache, 8GB RAM

CPU: O(1ns), L2/L3: O(10ns), RAM: O(100ns), disc: O(10ms)

5 / 48

Page 6: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchies

Important terms:

I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)

I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);

I cache-hit: required data is available in cache ⇒ fast access

I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access

Computer architecture is complicated. We need a basicperformance model.

I Processor needs to be “fed” with data to work on.

I Memory access is slow; memory hierarchies help.

6 / 48

Page 7: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchies

Important terms:

I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)

I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);

I cache-hit: required data is available in cache ⇒ fast access

I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access

Computer architecture is complicated. We need a basicperformance model.

I Processor needs to be “fed” with data to work on.

I Memory access is slow; memory hierarchies help.

6 / 48

Page 8: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchies

Important terms:

I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)

I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);

I cache-hit: required data is available in cache ⇒ fast access

I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access

Computer architecture is complicated. We need a basicperformance model.

I Processor needs to be “fed” with data to work on.

I Memory access is slow; memory hierarchies help.

6 / 48

Page 9: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchies

Important terms:

I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)

I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);

I cache-hit: required data is available in cache ⇒ fast access

I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access

Computer architecture is complicated. We need a basicperformance model.

I Processor needs to be “fed” with data to work on.

I Memory access is slow; memory hierarchies help.

6 / 48

Page 10: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchies

Important terms:

I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)

I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);

I cache-hit: required data is available in cache ⇒ fast access

I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access

Computer architecture is complicated. We need a basicperformance model.

I Processor needs to be “fed” with data to work on.

I Memory access is slow; memory hierarchies help.

6 / 48

Page 11: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchySimple model

1. Only consider two levels in hierarchy, fast (cache) and slow(RAM) memory

2. All data is initially in slow memory

3. Simplifications:I Ignore that memory access and arithmetic operations can

happen at the same timeI assume time for access to fast memory is 0

4. Computational intensity: flops per slow memory access

q =f

m, where f . . .#flops,m . . .#slow memop.

Computational intensity should be as large as possible.

7 / 48

Page 12: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchy

Example: Matrix-matrix multiply Comparison between naive andblocked optimized matrix-matrix multiplication for different matrixsizes: Different algorithms can increase the computational intensitysignificantly.BLAS: Optimized Basic Linear Algebra Subprograms

I Temporal and spatial locality is key for fast performance.

I Since arithmetic is cheap compared to memory access, one canconsider making extra flops if it reduces the memory access.

I In distributed-memory parallel computations, the memoryhierarchy is extended to data stored on other processors, whichis only available through communication over the network.

8 / 48

Page 13: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Memory hierarchy

Example: Matrix-matrix multiply Comparison between naive andblocked optimized matrix-matrix multiplication for different matrixsizes: Different algorithms can increase the computational intensitysignificantly.BLAS: Optimized Basic Linear Algebra Subprograms

I Temporal and spatial locality is key for fast performance.

I Since arithmetic is cheap compared to memory access, one canconsider making extra flops if it reduces the memory access.

I In distributed-memory parallel computations, the memoryhierarchy is extended to data stored on other processors, whichis only available through communication over the network.

8 / 48

Page 14: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

9 / 48

Page 15: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Levels of parallelismI Parallelism at the bit level (64-bit

operations)

I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle

I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .

all of the above assume single sequential control flow

I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow

10 / 48

Page 16: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Levels of parallelismI Parallelism at the bit level (64-bit

operations)

I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle

I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .

all of the above assume single sequential control flow

I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow

10 / 48

Page 17: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Levels of parallelismI Parallelism at the bit level (64-bit

operations)

I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle

I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .

all of the above assume single sequential control flow

I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow

10 / 48

Page 18: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Levels of parallelismI Parallelism at the bit level (64-bit

operations)

I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle

I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .

all of the above assume single sequential control flow

I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow

10 / 48

Page 19: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Amdahl’s lawIs there enough parallelism in my problem?

Suppose only part of the application is parallelAmdahl’s law:

I Let s be the fraction of work done sequentially, and (1− s)the part that is done in parallel

I p. . . number of parallel processor (cores).

Speedup:

time(1 proc)

time(p proc)≤ 1

s+ (1− s)/p≤ 1

s

Thus: Performance is limited bysequential part.

Source: Wikipedia

11 / 48

Page 20: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Load (im)balance in parallel computations

In parallel computations, the work should be distributed evenlyacross workers/processors.

I Load imbalance: Idle time due to insufficient parallelism orunequal sized tasks

I Initial/static load balancing: distribution of work at beginningof computation

I Dynamic load balancing: work load needs to be re-balancedduring computation. Imbalance can occur, e.g., due to

I adapting (mesh refinement)I in completely unstructured problems

12 / 48

Page 21: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 22: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 23: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 24: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 25: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

Weak scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 26: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

Weak scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 27: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

Weak scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 28: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

Weak scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Page 29: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Locality and parallelism

Locality of data to processing unit is critical in general, and evenmore so on parallel computers.

I Memory hierarchies even deeper onparallel computers.

I In parallel computing, data oftenmust be communicated throughthe network, which is slow. Again,locality is key.

I Computation should mostly be onlocal (cache-local, in DRAM,processor-local, disc-local) data

14 / 48

Page 30: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

15 / 48

Page 31: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Why Use Version Control?Slides adapted from Andreas Skielboe

A Version Control System (VCS) is an integrated fool-proofframework for

I Backup and Restore

I Short and long-term undo

I Tracking changes

I Synchronization

I Collaborating

I Sandboxing

... with minimal overhead.

16 / 48

Page 32: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Local Version Control Systems

Conventional version control systems provides some of thesefeatures by making a local database with all changes made to files.

Any file can be recreated by getting changes from the databaseand patch them up.

17 / 48

Page 33: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Centralized Version Control Systems

To enable synchronization and collaborative features the databaseis stored on a central VCS server, where everyone works in thesame database.

Introduces problems: single point of failure, inability to workoffline.

18 / 48

Page 34: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Distributed Version Control Systems

To overcome problems related to centralization, distributed VCSs(DVCSs) were invented. Keeping a complete copy of database inevery working directory.

Actually the most simple and most powerful implementation ofany VCS.

19 / 48

Page 35: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - The Git Workflow

The simplest use of Git:

I Modify files in your working directory.

I Stage the files, adding snapshots of them to your stagingarea.

I Commit, takes files in the staging area and stores thatsnapshot permanently to your Git directory.

20 / 48

Page 36: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - The Three States

The three basic states of files in your Git repository:

21 / 48

Page 37: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - Commits

Each commit in the git directory holds a snapshot of the files thatwere staged and thus went into that commit, along with authorinformation.

Each and every commit can always be looked at and retrieved.

22 / 48

Page 38: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - File Status Lifecycle

Files in your working directory can be in four different states inrelation to the current commit.

23 / 48

Page 39: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - Working with remotes

In Git all remotes are equal.

A remote in Git is nothing more than a link to another git directory.

24 / 48

Page 40: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - Working with remotes

The easiest commands to get started working with a remote are

I clone: Cloning a remote will make a complete local copy.

I pull: Getting changes from a remote.

I push: Sending changes to a remote.

Fear not! We are starting to get into more advanced topics. Solets look at some examples.

25 / 48

Page 41: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Git Basics - Advantages

Basic advantages of using Git:

I Nearly every operation is local.

I Committed snapshots are always kept.

I Strong support for non-linear development.

26 / 48

Page 42: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - First-Time Git Setup

Before using Git for the first time:

Pick your identity

$ git config --global user.name "John Doe"

$ git config --global user.email [email protected]

Check your settings

$ git config --list

Get help

$ git help <verb>

27 / 48

Page 43: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - Getting started with a bare remote server

Using a Git server (ie. no working directory / bare repository) isthe analogue to a regular centralized VCS in Git.

28 / 48

Page 44: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - Getting started with remote server

When the remote server is set up with an initialized Git directoryyou can simply clone the repository:

Cloning a remote repository

$ git clone <repository>

You will then get a complete local copy of that repository, whichyou can edit.

29 / 48

Page 45: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - Getting started with remote server

With your local working copy you can make any changes to thefiles in your working directory as you like. When satisfied with yourchanges you add any modified or new files to the staging areausing add:

Adding files to the staging area

$ git add <filepattern>

30 / 48

Page 46: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - Getting started with remote server

Finally to commit the files in the staging area you run commitsupplying a commit message.

Committing staging area to the repository

$ git commit -m <msg>

Note that so far everything is happening locally in your workingdirectory.

31 / 48

Page 47: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - Getting started with remote server

To share your commits with the remote you invoke the pushcommand:

Pushing local commits to the remote

$ git push

To recieve changes that other people have pushed to the remoteserver you can use the pull command:

Pulling remote commits to the local working directory

$ git pull

And thats it.

32 / 48

Page 48: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hands-on - Summary

Summary of a minimal Git workflow:

I clone remote repository

I add you changes to the staging area

I commit those changes to the git directory

I push your changes to the remote repository

I pull remote changes to your local working directory

33 / 48

Page 49: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

More advanced topics

Git is a powerful and flexible DVCS. Some very useful, but a bitmore advanced features include:

I Branching

I Merging

I Tagging

I Rebasing

34 / 48

Page 50: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

References

Some good Git sources for information:

I Git Community Book - http://book.git-scm.com/

I Pro Git - http://progit.org/

I Git Reference - http://gitref.org/

I GitHub - http://github.com/

I Git from the bottom up - http://ftp.newartisans.com/pub/git.from.bottom.up.pdf

I Understanding Git Conceptually -http://www.eecs.harvard.edu/~cduan/technical/git/

I Git Immersion - http://gitimmersion.com/

35 / 48

Page 51: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Applications

GUIs for Git:

I GitX (MacOS) - http://gitx.frim.nl/

I Giggle (Linux) - http://live.gnome.org/giggle

36 / 48

Page 52: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

37 / 48

Page 53: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Parallel architectures (Flynn’s taxonomy)

Characterization of architectures according to Flynn:

SISD: Single instruction, single data. This is the conventionalsequential model.

SIMD: Single instruction, multiple data. Multiple processing unitswith identical instructions, each one working on different data.Useful when a lot of completely identical tasks are needed.

MIMD: Multiple instructions, multiple data. Multiple processing unitswith separate (but often similar) instructions anddata/memory access (shared or distributed). We will mainlyuse this approach.

MISD: Multiple instructions, single data. Not practical.

38 / 48

Page 54: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Programming model must reflect architecture

Example: Inner product between two (very long) vectors: aT b:

I Where are a, b stored? Single memory or distributed?

I What work should be done by which processor?

I How do they coordinate their result?

39 / 48

Page 55: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Shared memory programming model

I Program is a collection ofcontrol threads, that arecreated dynamically

I Each thread has private andshared variables

I Threads can exchange data byreading/writing shared variables

I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition

Programming model must manage different threads and avoid raceconditions.OpenMP: Open Multi-Processing is the application interface (API)that supports shared memory parallelism: www.openmp.org

40 / 48

Page 56: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Shared memory programming model

I Program is a collection ofcontrol threads, that arecreated dynamically

I Each thread has private andshared variables

I Threads can exchange data byreading/writing shared variables

I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition

Programming model must manage different threads and avoid raceconditions.OpenMP: Open Multi-Processing is the application interface (API)that supports shared memory parallelism: www.openmp.org

40 / 48

Page 57: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Distributed memory programming model

I Program is run by a collectionof named processes; fixed atstart-up

I Local address space; no shareddata

I logically shared data isdistributed (e.g., everyprocessor only has direct accessto a chunk of rows of a matrix)

I Explicit communication throughsend/receive pairs

Programming model must accommodate communication.

MPI: Massage Passing Interface (different implementations: LAM,Open-MPI, Mpich, Mvapich), http://www.mpi-forum.org/

41 / 48

Page 58: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Distributed memory programming model

I Program is run by a collectionof named processes; fixed atstart-up

I Local address space; no shareddata

I logically shared data isdistributed (e.g., everyprocessor only has direct accessto a chunk of rows of a matrix)

I Explicit communication throughsend/receive pairs

Programming model must accommodate communication.

MPI: Massage Passing Interface (different implementations: LAM,Open-MPI, Mpich, Mvapich), http://www.mpi-forum.org/

41 / 48

Page 59: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Hybrid distributed/shared programming model

I Pure MPI approach splits the memory of a multicore processorinto independent memory pieces, and uses MPI to exchangeinformation between them.

I Hybrid approach uses MPI across processors, and OpenMP forprocessor cores that have access to the same memory.

I A similar hybrid approach is also used for hybrid architectures,i.e., computers that contain CPUs and accelerators (GPGPUs,MICs).

42 / 48

Page 60: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Other parallel programming approaches

I Grid computing: loosely coupled problems, most famousexample was SETI@Home.

I MapReduce: Introduced by Google; targets large data setswith parallel, distributed algorithms on a cluster.

I WebCL

I Pthreads

I CUDA

I Cilk

I . . .

43 / 48

Page 61: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

44 / 48

Page 62: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Shared memory programming model

I Program is a collection ofcontrol threads, that arecreated dynamically

I Each thread has private andshared variables

I Threads can exchange data byreading/writing shared variables

I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition

Only one process is running, which can fork into shared memorythreads.

45 / 48

Page 63: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Threads versus process

I A process is an independent execution unit, which containstheir own state information (pointers to instruction andstack). One process can contain several threads.

I Threads within a process share the same address space, andcommunicate directly using shared variables. Seperate stackbut shared heap memory.

I Stack memory: Used for temporarily storing data; fast;last-in-first-out principle. Examples int a=2; double

b=2.11; etc; no deallocation necessary; small size; static.

I Heap memory: Not managed automatically, manuallyallocate/de-allocate/re-allocate; slower; larger;

I Using several threads can also be useful on a single processor(“multithreading”), depending on the memory latency.

46 / 48

Page 64: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Shared memory programming

I POSIX Threads (Pthreads) library; more intrusive thanOpenMP.

I PGAS languages: partitioned global address space: logicallypartitioned but can be programmed like a global memoryaddress space (communication is taken care of in thebackground)

I OpenMP: open multi-processing is a light-weight applicationinterface (API), that supports shared memory parallelism.

47 / 48

Page 65: Advanced Topics in Numerical Analysis: High Performance ...stadler/hpc17/material/hpc17_3.pdf · Advanced Topics in Numerical Analysis: High Performance Computing MATH-GA 2012.001

Shared memory programming—Literature

OpenMP standard online:www.openmp.org

Very useful online course:www.youtube.com/user/OpenMPARB/

Recommended reading:

Chapter 6 in Chapter 7 in

48 / 48