multicore architecture and hybrid...

50
Multicore Architecture and Hybrid Programming Rebecca Hartman-Baker Oak Ridge National Laboratory [email protected] © 2004-2009 Rebecca Hartman-Baker. Reproduction permitted for non-commercial, educational use only.

Upload: others

Post on 23-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

Multicore Architecture and Hybrid Programming

Rebecca Hartman-Baker Oak Ridge National Laboratory

[email protected]

© 2004-2009 Rebecca Hartman-Baker. Reproduction permitted for non-commercial, educational use only.

Page 2: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

2

Outline

I.  Computer Architecture II.  Hybrid Programming

Page 3: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

3

I. COMPUTER ARCHITECTURE Mare Nostrum, installed in Chapel Torre Girona, Barcelona Supercomputing Center. By courtesy of Barcelona Supercomputing Center -- http://www.bsc.es/

Page 4: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

4

I. Computer Architecture 101

Source: http://en.wikipedia.org/wiki/Image:Computer_abstraction_layers.svg (author unknown)

Page 5: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

5

Computer Architecture 101

• Processors • Memory

–  Memory Hierarchy –  TLB

•  Interconnects • Glossary

Page 6: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

6

Computer Architecture 101: Processors

• CPU performs 4 basic operations: –  Fetch –  Decode –  Execute –  Writeback

Source: http://en.wikipedia.org/wiki/Image:CPU_block_diagram.svg

Page 7: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

7

CPU Operations

•  Fetch –  Retrieve instruction from program memory –  Location in memory tracked by program counter (PC) –  Instruction retrieval sped up by caching and pipelining

•  Decode –  Interpret instruction by breaking into meaningful parts, e.g., opcode,

operands

•  Execute –  Connect to portions of CPU to perform operation, e.g., connect to

arithmetic logic unit (ALU) to perform addition • Writeback

–  Write result of execution to memory

Page 8: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

8

Computer Architecture 101: Memory

• Hierarchy of memory –  Fast-access memory: small (expensive) –  Slower-access memory: large (less expensive)

• Cache: fast-access memory where frequently used data stored –  Reduces average access time –  Works because typically, applications have locality of reference –  Cache in XT4/5 also hierarchical

•  TLB: Translation lookaside buffer –  Used by memory management hardware to improve speed of

virtual address translation

Page 9: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

9

Cache Associativity

• Where to look in cache memory for copy of main memory location? –  Direct-Mapped/ 1-way Associative: only one location in cache for

each main memory location –  Fully Associative: can be stored anywhere in cache –  2-way Associative: two possible locations in cache –  N-way Associative: N possible locations in cache

• Doubling associativity ( 1 2, 2 4) has same effect on hit rate as doubling cache size

•  Increasing beyond 4 does not substantially improve hit rate; higher associativity done for other reasons

Page 10: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

10

Cache Associativity: Illustration

Main Memory

0123456789...

Cache Memory

0123

Main Memory

0123456789...

Cache Memory

0123

Direct-Mapped Cache

2-Way Associative Cache

Page 11: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

11

Computer Architecture 101: Interconnects

• Connect nodes of machine to one another

• Methods of interconnecting –  Fiber + switches and routers –  Directly connecting

•  Topology –  Torus –  Hypercube –  Butterfly –  Tree

2-D Torus Hypercube

Page 12: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

12

Computer Architecture 101: Glossary

• SSE (Streaming SIMD Extensions): instruction set extension to x86 architecture, allowing CPU to work on multiple instructions in single clock cycle

• DDR2 (Double Data Rate 2): synchronous dynamic random access memory, operates twice as fast as DDR1 –  DDR2-xyz: performs xyz million data transfers/second

• Dcache: cache devoted to data storage •  Icache: cache devoted to instruction storage • STREAM: data flow

Page 13: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

13

Facts and Figures

Kraken  (Cray  XT5)  

Athena  (Cray  XT4)  

Ranger  (Sun  Constella:on)  

Compute  Nodes   16,512   4512   3936  

Node   2.6  GHz  AMD  Istanbul  Dual  Six-­‐Core  

2.3  GHz  AMD  Opteron  Quad  Core  

2.3  GHz  AMD  Opteron  Quad  Core  x  4  

Memory   16  GB/node  DDR2-­‐800  

8  GB/node  DDR2-­‐667/DDR2-­‐800  

32  GB/node  DDR2-­‐667  

Network   Cray  SeaStar  2,  3-­‐D  torus  

Cray  SeaStar  2,  3-­‐D  torus  

Infiniband  switch,  fat  tree  

Peak   1.03  PF   166  TF   579  TF  

Page 14: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

14

XT4/5 Architecture

• Hardware –  Processors –  Memory

•  Memory Hierarchy •  TLB

–  System architecture –  Interconnects

• Software –  Operating System Integration –  CNL vs Linux

Page 15: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

15

Quad-Core Architecture

Page 16: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

16

Page 17: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

17

Quad Core Cache Hierarchy

Core 1

Cache Control

64 KB

512 KB

2 MB

Core 2

Cache Control

64 KB

512 KB

Core 3

Cache Control

64 KB

512 KB

Core 4

Cache Control

64 KB

512 KB

Page 18: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

18

L1 Cache

• Dedicated •  2-way associativity •  8 banks •  2 x 128-bit loads/cycle

Core 1

Cache Control

64 KB

512 KB

2 MB

Core 2

Cache Control

64 KB

512 KB

Core 3

Cache Control

64 KB

512 KB

Core 4

Cache Control

64 KB

512 KB

Page 19: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

19

L2 Cache

• Dedicated •  16-way associativity

Core 1

Cache Control

64 KB

512 KB

2 MB

Core 2

Cache Control

64 KB

512 KB

Core 3

Cache Control

64 KB

512 KB

Core 4

Cache Control

64 KB

512 KB

Page 20: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

20

L3 Cache

• Shared • Sharing-aware replacement

policy

Core 1

Cache Control

64 KB

512 KB

2 MB

Core 2

Cache Control

64 KB

512 KB

Core 3

Cache Control

64 KB

512 KB

Core 4

Cache Control

64 KB

512 KB

Core 1

Cache Control

64 KB

512 KB

2 MB

Core 2

Cache Control

64 KB

512 KB

Core 3

Cache Control

64 KB

512 KB

Core 4

Cache Control

64 KB

512 KB…

Page 21: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

21

Cray XT4 Architecture

•  XT4 is 4th generation Cray MPP •  Service nodes run full Linux •  Compute nodes run Compute Node

Linux (CNL)

Page 22: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

22

Cray XT4 Architecture

•  2- or 4-way SMP •  > 35 Gflops/node •  Up to 8 GB/node •  OpenMP Support within Socket

Page 23: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

23

Cray SeaStar2 Architecture

•  Router connects to 6 neighbors in 3-D torus –  Peak bidirectional BW 7.6 GB/s;

sustained 6 GB/s –  Reliable link protocol with error

correction and retransmission •  Communications Engine:

DMA Engine + PPC 440 –  Together, perform messaging

tasks so AMD processor can focus on computing

•  DMA Engine and OS together minimize latency with path directly from app to communication hardware (without traps and interrupts)

HyperTransport Interface

Memory

PowerPC 440 Processor

DMA Engine 6-Port

Router

Blade Control

Processor Interface

Page 24: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

24

Cray XT5 Architecture

•  8-way SMP •  > 70 Gflops/node •  Up to 32 GB shared memory/node •  OpenMP support

Page 25: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

25

Software Architecture

•  CLE microkernel on compute nodes

•  Full-featured Linux on service nodes

•  Software architecture eliminates jitter and enables reproducible runtimes

•  Even large machines can reboot in < 30 mins, including filesystem

Page 26: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

26

Software Architecture

• Compute PE (processing element): used for computation only; users cannot directly access compute nodes

• Service PEs: run full Linux –  Login: users access these nodes to develop code and submit jobs,

function like normal Linux box –  Network: provide high-speed connectivity with other systems –  System: run global system services such as system database –  I/O: provide connectivity to GPFS (global parallel file system)

Page 27: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

27

CLE vs Linux

• CLE (Cray Linux Environment) contains subset of Linux features

• Minimizes system overhead because little between application and bare hardware

Page 28: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

28

Resources: Computer Architecture 101

• Wikipedia articles on computer architecture: http://en.wikipedia.org/wiki/Computer_architecture , http://en.wikipedia.org/wiki/CPU , http://en.wikipedia.org/wiki/CPU_cache , http://en.wikipedia.org/wiki/DDR2_SDRAM , http://en.wikipedia.org/wiki/Microarchitecture , http://en.wikipedia.org/wiki/SSE2 , http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

• Heath, Michael T. (2007) Notes for CS 554, Parallel Numerical Algorithms, http://www.cse.uiuc.edu/courses/cs554/notes/index.html

Page 29: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

29

Resources: Cray XT4 Architecture

•  NCCS machines –  Jaguar: http://www.nccs.gov/computing-resources/jaguar/ –  Eugene: http://www.nccs.gov/computing-resources/eugene/ –  Jaguarpf: http://www.nccs.gov/jaguar/

•  AMD architecture –  Waldecker, Brian (2008) Quad Core AMD Opteron Processor

Overview, available at http://www.nccs.gov/wp-content/uploads/2008/04/amd_craywkshp_apr2008.pdf

–  Larkin, Jeff (2008) Optimizations for the AMD Core, available at http://www.nccs.gov/wp-content/uploads/2008/04/optimization1.pdf

•  XT4 Architecture –  Hartman-Baker, Rebecca (2008) XT4 Architecture and Software,

available at http://www.nccs.gov/wp-content/training/2008_users_meeting/4-17-08/using-xt44-17-08.pdf

Page 30: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

30

II. HYBRID PROGRAMMING Hybrid Car. Source: http://static.howstuffworks.com/gif/hybrid-car-hyper.jpg

Page 31: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

31

II. Hybrid Programming

• Motivation • Considerations • MPI threading support • Designing hybrid algorithms • Examples

Page 32: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

32

Motivation

• Multicore architectures are here to stay • Macro scale: distributed memory architecture, suitable for

MPI • Micro scale: each node contains multiple cores and shared

memory, suitable for OpenMP • Obvious solution: use MPI between nodes, and OpenMP

within nodes • Hybrid programming model

Page 33: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

33

Considerations

• Sounds great, Rebecca, but is hybrid programming always better? –  No, not always –  Especially if poorly programmed –  Depends also on suitability of architecture

•  Think of accelerator model –  in omp parallel region, use power of multicores; in serial region,

use only 1 processor –  If your code can exploit threaded parallelism “a lot”, then try hybrid

programming

Page 34: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

34

Considerations: Scalability

• As number of MPI processes increases, Amdahl’s Law hurts more and more –  Serial portion of code limits scalability –  A 0.1% serial code cannot have more than 1000x speedup, no

matter the number of processes –  May not matter on small process counts, but 1000x speedup vs.

serial on 224,256-core Jaguar? No!

• Hybrid programming can delay the inevitable –  On Jaguar, see speedup gains on up to 12,000 vs. 1000 cores (not

so embarrassing)

Page 35: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

35

Considerations: Changing to Hybrid Model

• MPI Hybrid –  Lots of fine-grained parallelism intrinsic in problem –  Data that could be shared rather than reproduced on each process

(e.g., constants, ghost cells)

• OpenMP Hybrid –  Want scalability to larger numbers of processors –  Coarse-grained tasks that could be distributed across machine

Page 36: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

36

Considerations: Developing Hybrid Algorithms

• Hybrid parallel programming model –  Are communication and computation discrete phases of algorithm? –  Can/do communication and computation overlap?

• Communication between threads –  Communicate only outside of parallel regions –  Assign a manager thread responsible for inter-process

communication –  Let some threads perform inter-process communication –  Let all threads communicate with other processes

Page 37: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

37

MPI Threading Support

• MPI-2 standard defines four threading support levels –  (0) MPI_THREAD_SINGLE only one thread allowed –  (1) MPI_THREAD_FUNNELED master thread is only thread

permitted to make MPI calls –  (2) MPI_THREAD_SERIALIZED all threads can make MPI

calls, but only one at a time –  (3) MPI_THREAD_MULTIPLE no restrictions –  (0.5) MPI calls not permitted inside parallel regions (returns MPI_THREAD_SINGLE) – this is MPI-1

Page 38: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

38

What Threading Model Does My Machine Support?

#include <mpi.h> #include <stdio.h> int main(int *argc, char **argv) {

MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

printf("Supports level %d of %d %d %d %d\n", provided, MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, MPI_THREAD_MULTIPLE);

MPI_Finalize(); return 0; }

Page 39: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

39

MPI_Init_Thread

• MPI_Init_thread(int required, int *supported) –  Use this instead of MPI_Init(…) –  required: the level of thread support you want –  supported: the level of thread support provided by

implementation (hopefully = required, but if not available, returns lowest level > required; failing that, largest level < required)

–  Using MPI_Init(…) is equivalent to required = MPI_THREAD_SINGLE

• MPI_Finalize() should be called by same thread that called MPI_Init_thread(…)

Page 40: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

40

Other Useful MPI Functions

• MPI_Is_thread_main(int *flag) –  Thread calls this to determine whether it is main thread

• MPI_Query_thread(int *provided) –  Thread calls to query level of thread support

Page 41: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

41

Supported Threading Models: Single

• Use single pragma #pragma omp parallel { #pragma omp barrier #pragma omp single { MPI_Xyz(…) } #pragma omp barrier }

Page 42: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

42

Supported Threading Models: Funneling

• Cray XT4/5 supports funneling (probably Ranger too?) • Use master pragma #pragma omp parallel { #pragma omp barrier #pragma omp master { MPI_Xyz(…) } #pragma omp barrier }

Page 43: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

43

What Threading Model Should I Use?

• Depends on the application!

Model   Advantages   Disadvantages  

Single   Portable:  every  MPI  implementaXon  supports  this  

Limited  flexibility  

Funneled   Simpler  to  program   Manager  thread  could  get  overloaded  

Serialized   Freedom  to  communicate   Risk  of  too  much  cross-­‐communicaXon  

MulXple   Completely  thread  safe   Limited  availability  

Page 44: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

44

Designing Hybrid Algorithms

•  Just because you can communicate thread-to-thread, doesn’t mean you should

•  Tradeoff between lumping messages together and sending individual messages –  Lumping messages together: one big message, one overhead –  Sending individual messages: less wait time (?)

• Programmability: performance will be great, when you finally get it working!

Page 45: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

45

Example: All-to-All Communication

•  4 quad-core nodes, 16 MPI processes

•  16 (# procs) x 15 (# msgs/proc) = 240 individual messages to be sent

• Communication volume: 16 x 15 x 1 = 240

•  4 quad-core nodes, 4 MPI processes with 4 threads each

•  4 (# procs) x 3 (#msgs/proc) = 12 agglomerated messages (4 x original length) to be sent

• Communication volume: 4 x 3 x 4 = 48 bandwidth

Page 46: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

46

Example: All-to-All Communication

16  MPI  Processes   4  MPI  Processes  +  4  OpenMP  threads  per  MPI  Process  

#  Messages   16  procs  x  15  msgs/proc  =  240  messages  

4  procs  x  3  msgs/proc  =  12  messages  

Message  size  (KB)  

1   4  

Bandwidth  required  

240  messages  x  1  size  =  240  KB   12  messages  x  4  size  =  48  KB  

4 quad-core nodes communicating 1 KB updates

Factor of 5 difference! Factor of 5 difference!

Page 47: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

47

Example: Mesh Partitioning

• Regular mesh of finite elements • When we partition mesh, need to communicate information

about (domain) adjacent cells to (computationally) remote neighbors

Page 48: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

48

Example: Mesh Partitioning

Page 49: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

49

Example: Mesh Partitioning Communication Patterns

Processor 1

Processor 2

Processor 3

0

0

0

1

1

1

2

2

2

Page 50: Multicore Architecture and Hybrid Programminghpcuniversity.org/vscse/petascale/downloads/multicore-arch10.pdf · – 2-way Associative: two possible locations in cache – ... •

50

Bibliography/Resources

•  von Alfthan, Sebastian, Introduction to Hybrid Programming, PRACE Summer School 2008, URL:http://www.prace-project.eu/hpc-training/prace-summer-school/hybridprogramming.pdf

•  Ye, Helen and Chris Ding, Hybrid OpenMP and MPI Programming and Tuning, Lawrence Berkeley National Laboratory, http://www.nersc.gov/nusers/services/training/classes/NUG/Jun04/NUG2004_yhe_hybrid.ppt

•  Zollweg, John, Hybrid Programming with OpenMP and MPI, Cornell University Center for Advanced Computing, http://www.cac.cornell.edu/education/Training/Intro/Hybrid-090529.pdf

•  MPI-2.0 Standard, Section 8.7, “MPI and Threads,” http://www.mpi-forum.org/docs/mpi-20-html/node162.htm#Node162