advanced topics in numerical analysis: high performance ...stadler/hpc17/material/hpc17_3.pdf ·...

Advanced Topics in Numerical Analysis:High Performance Computing

MATH-GA 2012.001 & CSCI-GA 2945.001

Georg StadlerCourant Institute, [email protected]

Spring 2017, Thursday, 5:10–7:00PM, WWH #512

Feb. 24, 2017

1 / 48

[email protected]

Outline

Organization issues

Last class summary

Why writing fast code isn’t easy

Git—repository systems

Parallel machines and programming models

Shared memory parallelism–OpenMP

2 / 48

Organization issues

I Please register for an XSEDE account such that Bill can addyou to the Stampede allocation (see his post on Piazza).You’re welcome to do that even if you only audit the course

I Please fill out the poll for makeup class either on March 6 or 7

I Next week, my colleague Marsha Berger will fill in for me

I And yes, there will be new homework at some point. . .

3 / 48

Outline

Organization issues

Last class summary





4 / 48

Memory hierarchiesOn my Mac Book Pro: 32KB L1 Cache, 256KB L2 Cache, 3MB Cache, 8GB RAM

CPU: O(1ns), L2/L3: O(10ns), RAM: O(100ns), disc: O(10ms)

5 / 48

Memory hierarchies

Important terms:

I latency: time it takes to load/write data from/at a specificlocation in RAM to/from the CPU registers (in seconds)

I bandwidth: rate at which data can be read/written (for largedata); in (bytes/second);

I cache-hit: required data is available in cache ⇒ fast access

I cache-miss: required data is not in cache and must be loadedfrom main memory (RAM) ⇒ slow access

Computer architecture is complicated. We need a basicperformance model.

I Processor needs to be “fed” with data to work on.

I Memory access is slow; memory hierarchies help.

6 / 48

Memory hierarchySimple model

1. Only consider two levels in hierarchy, fast (cache) and slow(RAM) memory

2. All data is initially in slow memory

3. Simplifications:I Ignore that memory access and arithmetic operations can

happen at the same timeI assume time for access to fast memory is 0

4. Computational intensity: flops per slow memory access

q =f

m, where f . . .#flops,m . . .#slow memop.

Computational intensity should be as large as possible.

7 / 48

Memory hierarchy

Example: Matrix-matrix multiply Comparison between naive andblocked optimized matrix-matrix multiplication for different matrixsizes: Different algorithms can increase the computational intensitysignificantly.BLAS: Optimized Basic Linear Algebra Subprograms

I Temporal and spatial locality is key for fast performance.

I Since arithmetic is cheap compared to memory access, one canconsider making extra flops if it reduces the memory access.

I In distributed-memory parallel computations, the memoryhierarchy is extended to data stored on other processors, whichis only available through communication over the network.

8 / 48

Outline

Organization issues

Last class summary





9 / 48

Levels of parallelismI Parallelism at the bit level (64-bit

operations)

I Parallelism by pipelining(overlapping of execution ofmultiple instructions); “assemblyline” parallelism,Instruction-Level-Parallelism (ILP);several operators per cycle

I multiple functional unitsparallelism: ALUs (algorithmiclogical units), FPUs (floating pointunits), load/store memory units,. . .

all of the above assume single sequential control flow

I process/thread level parallelism: independent processor cores,multicore processors; parallel control flow

10 / 48

Amdahl’s lawIs there enough parallelism in my problem?

Suppose only part of the application is parallelAmdahl’s law:

I Let s be the fraction of work done sequentially, and (1− s)the part that is done in parallel

I p. . . number of parallel processor (cores).

Speedup:

time(1 proc)

time(p proc)≤ 1

s+ (1− s)/p≤ 1

s

Thus: Performance is limited bysequential part.

Source: Wikipedia

11 / 48

Load (im)balance in parallel computations

In parallel computations, the work should be distributed evenlyacross workers/processors.

I Load imbalance: Idle time due to insufficient parallelism orunequal sized tasks

I Initial/static load balancing: distribution of work at beginningof computation

I Dynamic load balancing: work load needs to be re-balancedduring computation. Imbalance can occur, e.g., due to

I adapting (mesh refinement)I in completely unstructured problems

12 / 48

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Parallel scalabilityStrong and weak scaling/speedup

Strong scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

tim

e=

1/effi

cien

cy

Weak scalability

work

cputime

1 2 4 8 16 320%

20%

40%

60%

80%

100%

no of procs

effici

ency

13 / 48

Locality and parallelism

Locality of data to processing unit is critical in general, and evenmore so on parallel computers.

I Memory hierarchies even deeper onparallel computers.

I In parallel computing, data oftenmust be communicated throughthe network, which is slow. Again,locality is key.

I Computation should mostly be onlocal (cache-local, in DRAM,processor-local, disc-local) data

14 / 48

Outline

Organization issues

Last class summary





15 / 48

Why Use Version Control?Slides adapted from Andreas Skielboe

A Version Control System (VCS) is an integrated fool-proofframework for

I Backup and Restore

I Short and long-term undo

I Tracking changes

I Synchronization

I Collaborating

I Sandboxing

... with minimal overhead.

16 / 48

Local Version Control Systems

Conventional version control systems provides some of thesefeatures by making a local database with all changes made to files.

Any file can be recreated by getting changes from the databaseand patch them up.

17 / 48

Centralized Version Control Systems

To enable synchronization and collaborative features the databaseis stored on a central VCS server, where everyone works in thesame database.

Introduces problems: single point of failure, inability to workoffline.

18 / 48

Distributed Version Control Systems

To overcome problems related to centralization, distributed VCSs(DVCSs) were invented. Keeping a complete copy of database inevery working directory.

Actually the most simple and most powerful implementation ofany VCS.

19 / 48

Git Basics - The Git Workflow

The simplest use of Git:

I Modify files in your working directory.

I Stage the files, adding snapshots of them to your stagingarea.

I Commit, takes files in the staging area and stores thatsnapshot permanently to your Git directory.

20 / 48

Git Basics - The Three States

The three basic states of files in your Git repository:

21 / 48

Git Basics - Commits

Each commit in the git directory holds a snapshot of the files thatwere staged and thus went into that commit, along with authorinformation.

Each and every commit can always be looked at and retrieved.

22 / 48

Git Basics - File Status Lifecycle

Files in your working directory can be in four different states inrelation to the current commit.

23 / 48

Git Basics - Working with remotes

In Git all remotes are equal.

A remote in Git is nothing more than a link to another git directory.

24 / 48

Git Basics - Working with remotes

The easiest commands to get started working with a remote are

I clone: Cloning a remote will make a complete local copy.

I pull: Getting changes from a remote.

I push: Sending changes to a remote.

Fear not! We are starting to get into more advanced topics. Solets look at some examples.

25 / 48

Git Basics - Advantages

Basic advantages of using Git:

I Nearly every operation is local.

I Committed snapshots are always kept.

I Strong support for non-linear development.

26 / 48

Hands-on - First-Time Git Setup

Before using Git for the first time:

Pick your identity

$ git config --global user.name "John Doe"

$ git config --global user.email [email protected]

Check your settings

$ git config --list

Get help

$ git help <verb>

27 / 48

Hands-on - Getting started with a bare remote server

Using a Git server (ie. no working directory / bare repository) isthe analogue to a regular centralized VCS in Git.

28 / 48

Hands-on - Getting started with remote server

When the remote server is set up with an initialized Git directoryyou can simply clone the repository:

Cloning a remote repository

$ git clone <repository>

You will then get a complete local copy of that repository, whichyou can edit.

29 / 48


With your local working copy you can make any changes to thefiles in your working directory as you like. When satisfied with yourchanges you add any modified or new files to the staging areausing add:

Adding files to the staging area

$ git add <filepattern>

30 / 48


Finally to commit the files in the staging area you run commitsupplying a commit message.

Committing staging area to the repository

$ git commit -m <msg>

Note that so far everything is happening locally in your workingdirectory.

31 / 48


To share your commits with the remote you invoke the pushcommand:

Pushing local commits to the remote

$ git push

To recieve changes that other people have pushed to the remoteserver you can use the pull command:

Pulling remote commits to the local working directory

$ git pull

And thats it.

32 / 48

Hands-on - Summary

Summary of a minimal Git workflow:

I clone remote repository

I add you changes to the staging area

I commit those changes to the git directory

I push your changes to the remote repository

I pull remote changes to your local working directory

33 / 48

More advanced topics

Git is a powerful and flexible DVCS. Some very useful, but a bitmore advanced features include:

I Branching

I Merging

I Tagging

I Rebasing

34 / 48

References

Some good Git sources for information:

I Git Community Book - http://book.git-scm.com/

I Pro Git - http://progit.org/

I Git Reference - http://gitref.org/

I GitHub - http://github.com/

I Git from the bottom up - http://ftp.newartisans.com/pub/git.from.bottom.up.pdf

I Understanding Git Conceptually -http://www.eecs.harvard.edu/~cduan/technical/git/

I Git Immersion - http://gitimmersion.com/

35 / 48

http://book.git-scm.com/

http://progit.org/

http://gitref.org/

http://github.com/

http://ftp.newartisans.com/pub/git.from.bottom.up.pdf

http://ftp.newartisans.com/pub/git.from.bottom.up.pdf

http://www.eecs.harvard.edu/~cduan/technical/git/

http://gitimmersion.com/

Applications

GUIs for Git:

I GitX (MacOS) - http://gitx.frim.nl/

I Giggle (Linux) - http://live.gnome.org/giggle

36 / 48

http://gitx.frim.nl/

http://live.gnome.org/giggle

Outline

Organization issues

Last class summary





37 / 48

Parallel architectures (Flynn’s taxonomy)

Characterization of architectures according to Flynn:

SISD: Single instruction, single data. This is the conventionalsequential model.

SIMD: Single instruction, multiple data. Multiple processing unitswith identical instructions, each one working on different data.Useful when a lot of completely identical tasks are needed.

MIMD: Multiple instructions, multiple data. Multiple processing unitswith separate (but often similar) instructions anddata/memory access (shared or distributed). We will mainlyuse this approach.

MISD: Multiple instructions, single data. Not practical.

38 / 48

Programming model must reflect architecture

Example: Inner product between two (very long) vectors: aT b:

I Where are a, b stored? Single memory or distributed?

I What work should be done by which processor?

I How do they coordinate their result?

39 / 48

Shared memory programming model

I Program is a collection ofcontrol threads, that arecreated dynamically

I Each thread has private andshared variables

I Threads can exchange data byreading/writing shared variables

I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition

Programming model must manage different threads and avoid raceconditions.OpenMP: Open Multi-Processing is the application interface (API)that supports shared memory parallelism: www.openmp.org

40 / 48

www.openmp.org

Distributed memory programming model

I Program is run by a collectionof named processes; fixed atstart-up

I Local address space; no shareddata

I logically shared data isdistributed (e.g., everyprocessor only has direct accessto a chunk of rows of a matrix)

I Explicit communication throughsend/receive pairs

Programming model must accommodate communication.

MPI: Massage Passing Interface (different implementations: LAM,Open-MPI, Mpich, Mvapich), http://www.mpi-forum.org/

41 / 48

http://www.mpi-forum.org/

Hybrid distributed/shared programming model

I Pure MPI approach splits the memory of a multicore processorinto independent memory pieces, and uses MPI to exchangeinformation between them.

I Hybrid approach uses MPI across processors, and OpenMP forprocessor cores that have access to the same memory.

I A similar hybrid approach is also used for hybrid architectures,i.e., computers that contain CPUs and accelerators (GPGPUs,MICs).

42 / 48

Other parallel programming approaches

I Grid computing: loosely coupled problems, most famousexample was SETI@Home.

I MapReduce: Introduced by Google; targets large data setswith parallel, distributed algorithms on a cluster.

I WebCL

I Pthreads

I CUDA

I Cilk

I . . .

43 / 48

Outline

Organization issues

Last class summary





44 / 48

Shared memory programming model

I Program is a collection ofcontrol threads, that arecreated dynamically

I Each thread has private andshared variables

I Threads can exchange data byreading/writing shared variables

I Danger: more than 1 processorcore reads/writes to a memorylocation: race condition

Only one process is running, which can fork into shared memorythreads.

45 / 48

Threads versus process

I A process is an independent execution unit, which containstheir own state information (pointers to instruction andstack). One process can contain several threads.

I Threads within a process share the same address space, andcommunicate directly using shared variables. Seperate stackbut shared heap memory.

I Stack memory: Used for temporarily storing data; fast;last-in-first-out principle. Examples int a=2; double

b=2.11; etc; no deallocation necessary; small size; static.

I Heap memory: Not managed automatically, manuallyallocate/de-allocate/re-allocate; slower; larger;

I Using several threads can also be useful on a single processor(“multithreading”), depending on the memory latency.

46 / 48

Shared memory programming

I POSIX Threads (Pthreads) library; more intrusive thanOpenMP.

I PGAS languages: partitioned global address space: logicallypartitioned but can be programmed like a global memoryaddress space (communication is taken care of in thebackground)

I OpenMP: open multi-processing is a light-weight applicationinterface (API), that supports shared memory parallelism.

47 / 48

Shared memory programming—Literature

OpenMP standard online:www.openmp.org

Very useful online course:www.youtube.com/user/OpenMPARB/

Recommended reading:

Chapter 6 in Chapter 7 in

48 / 48

www.openmp.org

www.youtube.com/user/OpenMPARB/

advanced topics in numerical analysis: high performance ...stadler/hpc17/material/hpc17_3.pdf ·...

Documents